2 min readfrom Machine Learning

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Our take

After years of dedicated work, I am excited to share my extensive 103.1 billion-token Usenet corpus, encompassing a complete archive from 1980 to 2013. This dataset features 408 million posts across 18,347 newsgroups, reflecting 33 years of language evolution. The processing pipeline ensured meticulous cleaning and organization, including deduplication and language detection, resulting in a corpus that is 96.6% English. This rich historical resource offers unique insights into the evolution of online discourse.

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013.

Here's what it ended up being:

  • 103.1 billion tokens (cl100k_base)
  • 408 million posts across 9 newsgroup hierarchies
  • 18,347 newsgroups covered
  • 33 years of continuous coverage

The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before record-level cleaning), quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL.

Language detection was run on every record using Meta's fasttext LID-176. The corpus is 96.6% English with meaningful representation from 100+ other languages — the soc.culture.* groups in particular have high non-English density.

The thing I find most interesting about this dataset from a training perspective is the temporal arc. Volume is sparse pre-1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet gets displaced by forums and social media. That's a 33-year window of language evolution baked into a single coherent corpus — before SEO, before engagement optimization, before AI-generated content existed.

I've published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013

Happy to answer questions about the processing pipeline or the data itself.

submitted by /u/OwnerByDane
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#natural language processing for spreadsheets#natural language processing#large dataset processing#data cleaning solutions#generative AI for data analysis#Excel alternatives for data analysis#big data management in spreadsheets#enterprise-level spreadsheet solutions#conversational data analysis#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#rows.com#automated anomaly detection#financial modeling with spreadsheets#Usenet#corpus