2 min readfrom Machine Learning

Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]

Our take

Introducing StatForge, an open-source, async Python pipeline designed to streamline your statistical analysis. Frustrated by the tedious process of manually running Shapiro-Wilk tests at 2 AM, I created a solution that automates the statistical decision layer, generates APA-compliant methods, and enables conversational interactions with your datasets. By transforming DataFrames into searchable context windows, StatForge simplifies complex data plumbing, allowing you to focus on meaningful insights. Explore the architecture and contribute to the project on GitHub to enhance your data analysis experience.

TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipeline called StatForge that automates the statistical decision layer, writes APA methods, and lets you chat with your dataset using a microgpt-inspired retrieval system.

Hey everyone,

The hardest part of data analysis isn't the computation (we all have scipy and statsmodels). It's the plumbing—the sequence of choices between loading a CSV and having a defensible result.

I built StatForge to handle the plumbing.

How the pipeline works:

  • Lazy Loading: Detects 15+ formats (CSV, Parquet, SPSS, SQLite) and lazily imports dependencies so you don't pay for bloat.
  • Autonomous Assumption Checks: It doesn't just pass/fail normality. If a Shapiro-Wilk test returns a borderline p = 0.048, it flags it, runs both parametric and non-parametric tests, and compares the robustness of the results.
  • The Plugin Registry: Uses a register decorator pattern for easy custom model injection.

The microgpt Chat Mode: When Karpathy released his 200-line GPT, the way he loaded a corpus (docs: list[str]) changed how I looked at DataFrames. What if each row is a document? StatForge converts datasets into this format, scores rows against plain-English queries, pulls the top-k most relevant rows into a context window, and hits the Anthropic API (or a built-in rule engine). No vector DBs, no FAISS, just clean strings.

You can run a full analysis with one command!

I wrote a deep-dive on the architecture and the philosophy behind it here: https://shekhawatsamvardhan.medium.com/andrej-karpathy-dropped-a-200-line-gpt-d153e9557463

Repo is here if you want to break it or contribute: https://github.com/samvardhan03/statforge

Would love to hear how you handle your own stats plumbing, or if there are specific edge cases the decision tree should catch!

submitted by /u/Weary_Possible8913
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#no-code spreadsheet solutions#rows.com#generative AI for data analysis#Excel alternatives for data analysis#conversational data analysis#data analysis tools#natural language processing for spreadsheets#financial modeling with spreadsheets#big data management in spreadsheets#large dataset processing#row zero#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#spreadsheet API integration#data cleaning solutions#automated anomaly detection#StatForge