togethercomputer/RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

PythonShellMakefile
This is stars and forks stats for /togethercomputer/RedPajama-Data repository. As of 20 Apr, 2024 this repository has 3520 stars and 272 forks.

RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset This repo contains a reproducible data receipe for the RedPajama data, with the following token counts: Dataset Token Count Commoncrawl 878 Billion C4 175 Billion GitHub 59 Billion Books 26 Billion ArXiv 28 Billion Wikipedia 24 Billion StackExchange 20 Billion Total 1.2 Trillion Data Preparation In data_prep, we provide all pre-processing scripts and guidelines. Tokenization In tokenization, we provide an example of how to...
Read on GithubGithub Stats Page
repotechsstarsweeklyforksweekly
zilliztech/GPTCachePythonOther5.3k03620
wjz304/arpl-i18nShellCOther2.3k03690
aleskxyz/reality-ezpzShellPython77801090
PatrickAlphaC/foundry-smart-contract-lottery-f23SolidityMakefile26050
pashpashpash/vault-aiJavaScriptGoLess3.1k02980
nvim-neotest/neotest-plenaryLuaShell21050
kronosnet/knet-ci-testM4ShellMakefile0040
ricardoerikson/makefile-latexMakefileShell0000
kaqijiang/Auto-GPT-ZHPythonOther2.3k04010
haotian-liu/LLaVAPythonShellJavaScript7.6k06510