datasets
datasets's Repositories
Goal: Make a LLM's little brain go poof 🤯 Evaluate LLM vs LLM with help from my buddies at Oxen.ai.This is a dataset of unanswerable questions to test whether an LLM "knows when it does not know" and minimize hallucinations.
CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations.
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The original data can be found here: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
This is a list set of example datasets to help you go from an LLM data zero 🤔 to a LLM data hero 🦸.
Sampled version of cerebras/SlimPajama-627B. Since the original data was shuffled before chunking, it contains train/chunk1 (of 10 total) and further sampled 10%. This should result in roughly 6B tokens, hence SlimPajama-6B.