RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens
--
The llama in the pajama picture is just too cute not to use.
What is RedPajama?
Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.
More importantly, they are making the dataset open.
Even more, they aim to create a set of leading, fully open-source models.
The first step is to create the training set.
What’s in the data?
“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”
The following shows the construct of the dataset contrasting with LLaMA:
Dataset Structure
The dataset structure is as follows:
{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
"red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}Impact
Impact
In summary, foundation LLMs would be more accessible to the community, and not just the AI community, but everyone who has a use case and is looking for the AI “hammer”.
The commercial entities who own the best foundation models may find it more challenging to keep their LLMs as “moat” for long.
This trend has already been proven by stable diffusion, which showed that open-source models can rival the quality of commercial offerings like DALL-E and can lead to incredible creativity from community participation.
How to get the data?
The dataset consists of 2084 jsonl files. You can download the dataset using HuggingFace:
from datasets import load_dataset
ds = load_dataset("togethercomputer/RedPajama-Data-1T")
Things are getting exciting!
Happy practicing!
Thanks for reading my newsletter. You can follow me on Linkedin or Twitter @Angelina_Magr and Substack!
Good reads: LLaMA paper. LLaMA: Open and Efficient Foundation Language Models. Medium Blog. Stable Diffusion: Best Open Source Version of DALL·E 2. Facebook research blog. Introducing LLaMA: A foundational, 65-billion-parameter large language model. RedPajama Github.