RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens

Angelina Yang
2 min readMay 9, 2023

The llama in the pajama picture is just too cute not to use.

Source: Together

What is RedPajama?

Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.

More importantly, they are making the dataset open.

Even more, they aim to create a set of leading, fully open-source models.

The first step is to create the training set.

What’s in the data?

“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”

The following shows the construct of the dataset contrasting with LLaMA:

Source: Estimated comp

Dataset Structure

The dataset structure is as follows:

{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
"red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}Impact

Impact

--

--