AI Datasets
Curated datasets for LLM training, fine-tuning, and evaluation.
26 datasets
Common Crawl
FeaturedPetabyte-scale web crawl corpus updated monthly. The foundation dataset behind many LLM training pipelines.
The Pile
FeaturedAn 800GB diverse open-source language modeling dataset curated by EleutherAI from 22 high-quality subsets.
RedPajama v2
A 30T token web dataset with quality signals for LLM pretraining, created by Together AI.
FineWeb
Featured15T token dataset of cleaned and deduplicated English web data by HuggingFace, outperforming other web datasets.
The Stack v2
FeaturedThe largest open code dataset — 67.5TB from Software Heritage covering 619 programming languages.
StarCoder Data
Curated code pretraining data used to train StarCoder models. 783GB across 86 programming languages.
CodeSearchNet
Code-natural language pairs for 6 languages (Go, Java, JS, PHP, Python, Ruby) with function docstrings.
LAION-5B
Featured5.85 billion image-text pairs scraped from the web. One of the largest open multimodal datasets.
ShareGPT4V
100K high-quality image-text pairs with detailed GPT4-Vision generated captions for vision-language training.
LLaVA-Instruct-150K
150K visual instruction tuning samples generated by GPT-4, used to train LLaVA vision-language models.
OpenAssistant Conversations
Featured161K human-annotated assistant conversations in 35+ languages. High-quality RLHF training data.
Stanford Alpaca
52K instruction-following samples generated by GPT-3.5. Pioneered self-instruct fine-tuning approach.
UltraChat 200K
200K multi-turn dialogues covering a wide range of topics, filtered for quality from the full 1.5M set.
WildChat
1M real user-ChatGPT conversations with metadata. Captures diverse real-world usage patterns.
LMSYS-Chat-1M
Featured1M real conversations from Chatbot Arena covering 25+ LLMs with user preferences and metadata.
MMLU
FeaturedMassive Multitask Language Understanding — 57 subjects from STEM to humanities. The standard LLM benchmark.
HellaSwag
Commonsense NLI benchmark — choose the most plausible continuation among 4 options. Tests everyday reasoning.
HumanEval
164 hand-written Python programming problems by OpenAI. The standard benchmark for code generation.
GSM8K
8.5K grade-school math problems with step-by-step solutions. Key benchmark for mathematical reasoning.
MT-Bench
80 multi-turn questions judged by GPT-4. Tests instruction following across 8 categories.
AlpacaEval
Automatic LLM evaluation benchmark using GPT-4 as judge. 805 instructions from diverse sources.
Wikipedia
FeaturedFull Wikipedia dumps in 300+ languages. Commonly used for pretraining and knowledge grounding.
C4 (Colossal Clean Crawled Corpus)
~750GB cleaned English web text from Common Crawl. Used to train T5 and many other models.
ARC (AI2 Reasoning Challenge)
7.7K grade-school science questions in Easy and Challenge sets. Tests science reasoning and world knowledge.
MATH
Featured12.5K competition-level math problems with step-by-step LaTeX solutions across 7 difficulty levels.
TheoremQA
800 theorem-driven questions across math, physics, CS, and finance requiring multi-step reasoning.
FAQ
How are these datasets selected?
Each dataset is selected based on community adoption, citation count, data quality, and practical value for LLM training, fine-tuning, or evaluation. We prioritize openly licensed datasets hosted on HuggingFace.
Can I use these datasets commercially?
License varies by dataset — check the license field on each card. Datasets with MIT, Apache 2.0, or CC-BY licenses are generally safe for commercial use. Some datasets like Alpaca use CC-BY-NC which restricts commercial use.
What formats are commonly used?
Most modern AI datasets use Parquet (columnar, efficient) or JSONL (one JSON object per line). HuggingFace Datasets library can load both formats with a single line of code.