Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The raw CommonCrawl has 100 trillion tokens, admittedly some duplicated. RedPajama has 30T deduplicated. That’s most of the way there, before including PDFs and Alibaba’s other data sources (Does Common Crawl include Chinese pages? Edit: Yes)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: