Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
bionhoward
6 months ago
|
parent
|
context
|
favorite
| on:
Qwen3: Think deeper, act faster
The raw CommonCrawl has 100 trillion tokens, admittedly some duplicated. RedPajama has 30T deduplicated. That’s most of the way there, before including PDFs and Alibaba’s other data sources (Does Common Crawl include Chinese pages? Edit: Yes)
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: