datasets
Organization Account
datasets's Repositories
Displaying Page 16 of 17 (166 total Repositories)

A growing and diverse dataset of text for AI to graze on and learn new information. Just like a pasture in the wild, it is a combination of sources. All the data is in Arrow format so it is easy to randomly access and stream.

43.8 gb
1201
Updated: 6 months ago

LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.

13.3 gb
81K11
Updated: 6 months ago

Storing wikipedia embeddings in a parquet file.

71 B
1
Updated: 7 months ago
Public
3

Wikipedia dataset containing cleaned articles. There are 6.4 million articles that can be streamed via apache arrow files.

20.4 gb
165
Updated: 7 months ago
Public
1

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

316.2 mb
1
Updated: 7 months ago
Public
1

The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia.

Public
1

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us.

459.1 kB
1
Updated: 7 months ago
Public
4

GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

Public
9

BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.

CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.