DevTk.AI

AI Datasets

Curated datasets for LLM training, fine-tuning, and evaluation.

26 datasets

Common Crawl

Featured
NLP

Petabyte-scale web crawl corpus updated monthly. The foundation dataset behind many LLM training pipelines.

> 1 PBWARC / WET
CC0 (public domain)

The Pile

Featured
NLP

An 800GB diverse open-source language modeling dataset curated by EleutherAI from 22 high-quality subsets.

800 GBJSONL
MIT

RedPajama v2

NLP

A 30T token web dataset with quality signals for LLM pretraining, created by Together AI.

30T tokensParquet / JSONL
Apache 2.0

FineWeb

Featured
NLP

15T token dataset of cleaned and deduplicated English web data by HuggingFace, outperforming other web datasets.

15T tokensParquet
ODC-By

The Stack v2

Featured
Code

The largest open code dataset — 67.5TB from Software Heritage covering 619 programming languages.

67.5 TBParquet
Various (per-file)

StarCoder Data

Code

Curated code pretraining data used to train StarCoder models. 783GB across 86 programming languages.

783 GBParquet
Various (per-file)

CodeSearchNet

Code

Code-natural language pairs for 6 languages (Go, Java, JS, PHP, Python, Ruby) with function docstrings.

2 GBJSONL
MIT

LAION-5B

Featured
Multimodal

5.85 billion image-text pairs scraped from the web. One of the largest open multimodal datasets.

240 TB (images)Parquet + URLs
CC-BY 4.0 (metadata)

ShareGPT4V

Multimodal

100K high-quality image-text pairs with detailed GPT4-Vision generated captions for vision-language training.

~50 GBJSON + Images
CC-BY-NC 4.0

LLaVA-Instruct-150K

Multimodal

150K visual instruction tuning samples generated by GPT-4, used to train LLaVA vision-language models.

~300 MBJSON
CC-BY-NC 4.0

OpenAssistant Conversations

Featured
Instruction Tuning

161K human-annotated assistant conversations in 35+ languages. High-quality RLHF training data.

~400 MBParquet
Apache 2.0

Stanford Alpaca

Instruction Tuning

52K instruction-following samples generated by GPT-3.5. Pioneered self-instruct fine-tuning approach.

~50 MBJSON
CC-BY-NC 4.0

UltraChat 200K

Instruction Tuning

200K multi-turn dialogues covering a wide range of topics, filtered for quality from the full 1.5M set.

~500 MBParquet
MIT

WildChat

Instruction Tuning

1M real user-ChatGPT conversations with metadata. Captures diverse real-world usage patterns.

~5 GBParquet
ODC-By

LMSYS-Chat-1M

Featured
Instruction Tuning

1M real conversations from Chatbot Arena covering 25+ LLMs with user preferences and metadata.

~3 GBParquet
CC-BY-NC 4.0

MMLU

Featured
Evaluation

Massive Multitask Language Understanding — 57 subjects from STEM to humanities. The standard LLM benchmark.

~300 MBParquet
MIT

HellaSwag

Evaluation

Commonsense NLI benchmark — choose the most plausible continuation among 4 options. Tests everyday reasoning.

~70 MBParquet
MIT

HumanEval

Evaluation

164 hand-written Python programming problems by OpenAI. The standard benchmark for code generation.

~1 MBJSONL
MIT

GSM8K

Evaluation

8.5K grade-school math problems with step-by-step solutions. Key benchmark for mathematical reasoning.

~10 MBParquet
MIT

MT-Bench

Evaluation

80 multi-turn questions judged by GPT-4. Tests instruction following across 8 categories.

~5 MBJSON
CC-BY 4.0

AlpacaEval

Evaluation

Automatic LLM evaluation benchmark using GPT-4 as judge. 805 instructions from diverse sources.

~5 MBJSON
Apache 2.0

Wikipedia

Featured
Knowledge

Full Wikipedia dumps in 300+ languages. Commonly used for pretraining and knowledge grounding.

~21 GB (en)Parquet
CC-BY-SA 3.0

C4 (Colossal Clean Crawled Corpus)

Knowledge

~750GB cleaned English web text from Common Crawl. Used to train T5 and many other models.

~750 GBParquet / JSONL
ODC-By

ARC (AI2 Reasoning Challenge)

Reasoning

7.7K grade-school science questions in Easy and Challenge sets. Tests science reasoning and world knowledge.

~3 MBParquet
CC-BY-SA 4.0

MATH

Featured
Reasoning

12.5K competition-level math problems with step-by-step LaTeX solutions across 7 difficulty levels.

~60 MBJSONL
MIT

TheoremQA

Reasoning

800 theorem-driven questions across math, physics, CS, and finance requiring multi-step reasoning.

~5 MBJSON
MIT

FAQ

How are these datasets selected?

Each dataset is selected based on community adoption, citation count, data quality, and practical value for LLM training, fine-tuning, or evaluation. We prioritize openly licensed datasets hosted on HuggingFace.

Can I use these datasets commercially?

License varies by dataset — check the license field on each card. Datasets with MIT, Apache 2.0, or CC-BY licenses are generally safe for commercial use. Some datasets like Alpaca use CC-BY-NC which restricts commercial use.

What formats are commonly used?

Most modern AI datasets use Parquet (columnar, efficient) or JSONL (one JSON object per line). HuggingFace Datasets library can load both formats with a single line of code.