Member-only story
RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens
2 min readMay 9, 2023
The llama in the pajama picture is just too cute not to use.
What is RedPajama?
Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.
More importantly, they are making the dataset open.
Even more, they aim to create a set of leading, fully open-source models.
The first step is to create the training set.
What’s in the data?
“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”
The following shows the construct of the dataset contrasting with LLaMA:
Dataset Structure
The dataset structure is as follows:
{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
"red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}Impact