AI Datasets

Curated datasets for LLM training, fine-tuning, and evaluation.

26 datasets

Common Crawl

Featured

NLP

Petabyte-scale web crawl corpus updated monthly. The foundation dataset behind many LLM training pipelines.

> 1 PBWARC / WET

CC0 (public domain)

The Pile

Featured

NLP

An 800GB diverse open-source language modeling dataset curated by EleutherAI from 22 high-quality subsets.

800 GBJSONL

MIT

RedPajama v2

NLP

A 30T token web dataset with quality signals for LLM pretraining, created by Together AI.

30T tokensParquet / JSONL

Apache 2.0

FineWeb

Featured

NLP

15T token dataset of cleaned and deduplicated English web data by HuggingFace, outperforming other web datasets.

15T tokensParquet

ODC-By

The Stack v2

Featured

Code

The largest open code dataset — 67.5TB from Software Heritage covering 619 programming languages.

67.5 TBParquet

Various (per-file)

StarCoder Data

Code

Curated code pretraining data used to train StarCoder models. 783GB across 86 programming languages.

783 GBParquet

Various (per-file)

CodeSearchNet

Code

Code-natural language pairs for 6 languages (Go, Java, JS, PHP, Python, Ruby) with function docstrings.

2 GBJSONL

MIT

LAION-5B

Featured

Multimodal

5.85 billion image-text pairs scraped from the web. One of the largest open multimodal datasets.

240 TB (images)Parquet + URLs

CC-BY 4.0 (metadata)

ShareGPT4V

Multimodal

100K high-quality image-text pairs with detailed GPT4-Vision generated captions for vision-language training.

~50 GBJSON + Images

CC-BY-NC 4.0

LLaVA-Instruct-150K

Multimodal

150K visual instruction tuning samples generated by GPT-4, used to train LLaVA vision-language models.

~300 MBJSON

CC-BY-NC 4.0

OpenAssistant Conversations

Featured

Instruction Tuning

161K human-annotated assistant conversations in 35+ languages. High-quality RLHF training data.

~400 MBParquet

Apache 2.0

Stanford Alpaca

Instruction Tuning

52K instruction-following samples generated by GPT-3.5. Pioneered self-instruct fine-tuning approach.

~50 MBJSON

CC-BY-NC 4.0

UltraChat 200K

Instruction Tuning

200K multi-turn dialogues covering a wide range of topics, filtered for quality from the full 1.5M set.

~500 MBParquet

MIT

WildChat

Instruction Tuning

1M real user-ChatGPT conversations with metadata. Captures diverse real-world usage patterns.

~5 GBParquet

ODC-By

LMSYS-Chat-1M

Featured

Instruction Tuning

1M real conversations from Chatbot Arena covering 25+ LLMs with user preferences and metadata.

~3 GBParquet

CC-BY-NC 4.0

MMLU

Featured

Evaluation

Massive Multitask Language Understanding — 57 subjects from STEM to humanities. The standard LLM benchmark.

~300 MBParquet

MIT

HellaSwag

Evaluation

Commonsense NLI benchmark — choose the most plausible continuation among 4 options. Tests everyday reasoning.

~70 MBParquet

MIT

HumanEval

Evaluation

164 hand-written Python programming problems by OpenAI. The standard benchmark for code generation.

~1 MBJSONL

MIT

GSM8K

Evaluation

8.5K grade-school math problems with step-by-step solutions. Key benchmark for mathematical reasoning.

~10 MBParquet

MIT

MT-Bench

Evaluation

80 multi-turn questions judged by GPT-4. Tests instruction following across 8 categories.

~5 MBJSON

CC-BY 4.0

AlpacaEval

Evaluation

Automatic LLM evaluation benchmark using GPT-4 as judge. 805 instructions from diverse sources.

~5 MBJSON

Apache 2.0

Wikipedia

Featured

Knowledge

Full Wikipedia dumps in 300+ languages. Commonly used for pretraining and knowledge grounding.

~21 GB (en)Parquet

CC-BY-SA 3.0

C4 (Colossal Clean Crawled Corpus)

Knowledge

~750GB cleaned English web text from Common Crawl. Used to train T5 and many other models.

~750 GBParquet / JSONL

ODC-By

ARC (AI2 Reasoning Challenge)

Reasoning

7.7K grade-school science questions in Easy and Challenge sets. Tests science reasoning and world knowledge.

~3 MBParquet

CC-BY-SA 4.0

MATH

Featured

Reasoning

12.5K competition-level math problems with step-by-step LaTeX solutions across 7 difficulty levels.

~60 MBJSONL

MIT

TheoremQA

Reasoning

800 theorem-driven questions across math, physics, CS, and finance requiring multi-step reasoning.

~5 MBJSON

MIT

FAQ

How are these datasets selected?

Each dataset is selected based on community adoption, citation count, data quality, and practical value for LLM training, fine-tuning, or evaluation. We prioritize openly licensed datasets hosted on HuggingFace.

Can I use these datasets commercially?

License varies by dataset — check the license field on each card. Datasets with MIT, Apache 2.0, or CC-BY licenses are generally safe for commercial use. Some datasets like Alpaca use CC-BY-NC which restricts commercial use.

What formats are commonly used?

Most modern AI datasets use Parquet (columnar, efficient) or JSONL (one JSON object per line). HuggingFace Datasets library can load both formats with a single line of code.