RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens

Angelina Yang

2 min readMay 9, 2023

The llama in the pajama picture is just too cute not to use.

What is RedPajama?

Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.

More importantly, they are making the dataset open.

Even more, they aim to create a set of leading, fully open-source models.

The first step is to create the training set.

What’s in the data?

“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”

The following shows the construct of the dataset contrasting with LLaMA:

Dataset Structure

The dataset structure is as follows:

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}Impact

RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens

What is RedPajama?

What’s in the data?

Dataset Structure

Impact

Written by Angelina Yang