RedPajama Reproducing LLaMAšŸ¦™ Dataset on 1.2 Trillion Tokens

Angelina Yang
2 min readMay 9, 2023

The llama in the pajama picture is just too cute not to use.

Source: Together

What is RedPajama?

Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.

More importantly, they are making the dataset open.