RedPajama Reproducing LLaMAšŸ¦™ Dataset on 1.2 Trillion Tokens

Angelina Yang
2 min readMay 9, 2023

What is RedPajama?

Redpajama is a project that recreated the LLaMA training dataset of over 1.2 trillion tokens.

More importantly, they are making the dataset open.