Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]
Our take
Hello r/MachineLearning! I'm diving deep into AI and ML within the US transit industry, inspired by Andrej Karpathy's autoresearch framework. I applied this innovative approach to a 33M-token public transit dataset, aiming to assess its effectiveness for specialized data. Despite challenges, I achieved a 14% improvement while validating the methodology. I'm eager to hear your insights on potential pitfalls, intriguing aspects, and recommendations for further exploration. Join me as we explore the intersection of autoresearch and transit data—your input is invaluable!
In a recent exploration of Andrej Karpathy's autoresearch framework, a machine learning enthusiast in the U.S. transit industry has made significant strides by applying this innovative approach to a specialized dataset of 33 million tokens. The project not only achieved a remarkable 14% improvement in model performance but also raised critical questions about the efficacy of autoresearch when applied to smaller, domain-specific datasets. This endeavor highlights the broader implications for machine learning practitioners, especially those working with niche datasets. As noted in similar initiatives, such as the one detailed in "[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch," the autonomy and adaptability of machine learning tools are becoming increasingly vital in research and industry applications.
The author’s approach to leveraging Karpathy's framework illustrates a thoughtful experimentation process, emphasizing how a precise methodology can yield valuable insights, even when working with constraints on hardware and data size. This is particularly relevant as many organizations often grapple with the limitations imposed by legacy systems and smaller datasets. The key insight here is that the autoresearch framework can be adapted for smaller, specialized data, but it requires a robust safety net to avoid false positives. By ensuring the model does not see the held-out validation scores directly, the author successfully mitigated the risk of overfitting, a common pitfall in machine learning projects. This method serves as a reminder that while frameworks can provide structure, they must be tailored to the unique challenges posed by specific contexts.
Another noteworthy aspect of this project is the realization that changing how often the model updates—rather than altering its architecture—can lead to substantial performance gains. The decision to halve the batch size resulted in a significant increase in training updates within the same time frame, which runs counter to the conventional wisdom that larger batches yield more reliable training outcomes. This finding is a testament to the power of experimentation and adapting methodologies in real-time, encouraging practitioners to remain open-minded about established practices in the field. The author’s willingness to embrace noise in training updates showcases a progressive mindset that seeks to challenge and refine existing paradigms.
As the author contemplates future directions, they pose intriguing questions to the machine learning community, inviting collaboration and input on the next steps. This open dialogue is critical in the ever-evolving landscape of AI and machine learning, where collective insight can drive innovation. Potential avenues for exploration, such as replicating the study at different random seeds or comparing results against general-purpose corpora, could yield further clarity on the strengths of autoresearch in varied contexts. Additionally, the prospect of comparing from-scratch training against domain-adaptive pretraining (DAPT) could illuminate the trade-offs between novel methodologies and established practices.
In conclusion, this project exemplifies the potential of adaptive frameworks in machine learning while highlighting the importance of thorough experimentation and critical evaluation of results. As the field continues to advance, questions about methodology, data specificity, and the role of user-centered design in AI development will remain pertinent. The insights gained from this endeavor not only contribute to the broader discourse on AI applications but also beckon further exploration into how these frameworks can empower users across diverse industries. What other innovative applications might arise as practitioners continue to push the boundaries of existing technologies? The answers may well shape the future of data management and machine learning.
![Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2F5b0ndl0lfdyg1.png%3Fwidth%3D140%26height%3D78%26auto%3Dwebp%26s%3Dd84231a8711ffa8c6379b82678b6f20c3ef0e78f&w=3840&q=75)
| Hello r/MachineLearning! I work in the US transit industry and I went all-in on learning AI & ML a few months ago. When I heard about Andrej Karpathy's autoresearch framework, I thought it was really cool. I decided to use the same transit dataset from an earlier GPT-2 XL fine-tuning project to train a small 80M model from scratch. Autoresearch is designed for from-scratch pretraining (not fine-tuning) so I started a new project rather than retrofitting the GPT-2 XL one. I would love to hear from you …
Why did I do this?My understanding is that Karpathy's autoresearch framework is an LLM-driven research loop: an agent edits a single training script, runs a 5-minute training experiment on a fixed dataset, and commits or reverts based on a single scalar metric. It was designed and tested on FineWeb (effectively, an infinite web-scale text). However, my model is industry-specific and wayyy smaller data set. In reviewing Karpathy’s wiki, I explored whether its core mechanics (such as the autonomous experiment loop, the 5-min training limit, and the single-scalar pass/fail ratchet) still produce significant perplexity reductions with limited data. So, I forked autoresearch, pointed it at a small transit-data corpus (~ 33 million tokens including traffic analysis, train plans, and regulatory Q&A pairs), and set out to answer two main questions: Question #1 Does autoresearch work on a corpus six orders of magnitude smaller than its design target? Question #2: What does the autoresearch agent find that I wouldn't have proposed? To be clear, the output was intended as a methodology validation, not a deployable chatbot. I wanted to know whether the framework's pattern (autonomous overnight experiments, single-scalar ratchet, git-as-tracker) holds up when the data is small and specialized. My Project constraints
My Design choices and whyEarly on, I came across a few Challenges. The autoresearch framework makes three assumptions that didn’t seem to hold for my experiment: that FlashAttention-3 kernels are available on the GPU, that the agent's "one change per experiment" rule can be honored with the existing architecture controls, and that the held-out data is big enough to resist adaptive overfitting. None of those held in my setup. Each of which is addressed below.
A few more pivots: I split the transit corpus into four parts (train, dev, val_public, test_private), grouping by topic so no document spans the boundary between any two parts — this prevents leakage between training, the agent's working data, the commit-gate data, and the data we hold back for milestone checks. The tokenizer is custom-built so 65 high-frequency transit acronyms (FTA, MBTA, NTD, IIJA, etc.) each encode as a single token instead of getting split into subword fragments. And before the agent loop ran, I trained the same baseline five times with different random seeds to measure how much each score swings from random luck — that gave me a noise floor for telling real improvements from random variation later on. Key findingsThe biggest single change seemed counterintuitive to me at first. The agent halved the batch size twice — from 524K tokens per training step down to 131K — fitting 3.6× more training updates into the same 5-minute budget (118 training updates ---> 427 training updates). Only the number of updates went up, with noisier signal in each one, and the Muon optimizer handled the noise without breaking. I would have rejected this in code review on the conventional "bigger batches train more reliably" advice; the agent didn't share that bias and found it on experiment 13, after eight failed architectural attempts. The Model size curve (below) settled the size question. 80M parameters was the clean peak; 30M and 50M lacked capacity, while 100M and 150M couldn't train enough optimizer steps in 5 minutes to compete (150M only ran for 84 steps before time ran out). The methodology layer identified two false positives. Two experiments improved the agent's working metric (dev_bpb) but did not apply to the held-out surface (val_public_bpb). Without the hidden-gate, both would have made errors; instead, both reverted. Then my rigor pass humbled me quite a bit. When I replicated the late-stage "winners" at a different random seed (INIT_SEED=43), the language-modeling result held rock-solid (Δ within ±0.005 across four runs, two architectures × two seeds), but two apparent accuracy improvements collapsed: Terminology accuracy swung 9 percentage points between seeds and Regulatory citation accuracy swung 15% points. A proper statistical test on the accuracy benchmarks (terminology, Q&A, regulatory citation) showed that only 1 of 8 head-to-head comparisons was statistically significant. The conclusion was unavoidable: the language-modeling improvement is real (validated separately, ~20x above noise and replicated at a fresh seed), but the apparent domain-accuracy "wins" turned out to be noise at our 100-250-item benchmark sizes.
Key learningsFive lessons from this project I plan to carry into any autoresearch-on-small-data follow-up:
Next stepsHonestly, I'm not sure where to go from here. There’s a few directions that all feel worth pursuing, and I'd love input from the ML community on which is most interesting. The three I'm weighing: 1. Replicate the project at fresh random seeds. Re-run the full Phase 5 + Phase 7 pipeline at two or three new seeds to see whether the same wins (or close results) emerge … and whether the same false positives recur. I want to know "is the methodology repeatable, or did I get lucky in a different way?" 2. Run autoresearch "by the book" on a general-purpose corpus. Clone Karpathy's main repo without my AutoTransit changes and test it on a chunk of FineWeb, which is what the framework is designed for. Comparing the results here to those on my small, specialized dataset will show what findings are general about autoresearch and what are specific to small data. 3. Compare what I did from scratch to domain-adaptive pretraining (DAPT). I would use a similarly sized pretrained model off the shelf—Pythia-160M, already trained on web text—and continue training it on my transit dataset. Keep the same data, eval method, and approach. The main question is whether starting from random weights can compete with the obvious shortcut—most research says it shouldn’t from what I gather. If my from-scratch result holds up, that's the interesting part; if not, I’d still learn something useful.
THANK YOU if you’ve read or scrolled this far!! Lol. Please share your thoughts …. Where’d I mess up? What’s interesting? What should I consider doing next?
[link] [comments] |
Read on the original site
Open the publisher's page for the full experience