[P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearch
Our take
Inspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary classification tasks (churn, conversion, etc.).
You give it a dataset. It loops forever: analyze data, form hypothesis, edit code, run experiment, evaluate with expanding time windows (train on past, predict future - no leakage), keep or revert via git. It edits only 3 files - feature engineering, model hyperparams, and analysis code. Everything else is locked down.
Edit: To clarify based on some comments, I am using this to solve the problem of finding new signals to add to the model, not trying to overfit a limited dataset. -end Edit-
Key design decisions:
- Introducing an analysis loop in addition to the experiment loop, this allow for better reflection and experimentation.
- Optimize for experiment throughput with a bunch of decisions: Use LightGBM as default model, limit feature count and tree count, locking down training run until it finishes.
- Constrained editing surface: only 3 files + logs. No infrastructure changes, no package installs. Without this, the agent will eventually try to modify the evaluation code to "improve" its score.
- Docker sandbox - the agent runs with full shell access (--dangerously-skip-permissions). Container keeps it contained.
- Expanding time windows over k-fold - mean score across multiple temporal train/test splits.
- Forced logging - every experiment gets a LOG.md entry (hypothesis, result, takeaway). Significant insights go to LEARNING.md. You can read the agent's reasoning after the fact.
- Analysis primitives built-in - univariate AUC, correlation pairs, null rates, feature importance, error analysis. The agent writes analysis code using these to save time, they also serve as initial suggestions for the first few analyses.
What I learned building this:
- Air-tight evaluation is the essential for real improvement - this lesson hit me twice:
- Earlier version didn't constraint which file the agent could edit, it eventually changed the evaluation code to make "improvement" easier for itself.
- K-fold validation was originally employed, the agent found improvements that are actually data leakage and didn't hold out-of-time. After a painful manual inspection, I switched over to expanding time windows.
- Do everything to protect experiment throughput - this lesson also hit twice:
- Initially, I let the model run wild and was not very impressed when it barely run 20 experiments overnight. Turns out, the agent engineered thousands of new features that slowed down training and crash some runs due to RAM limit. I added the feature count limit and tree count limit to make sure training time is reasonable.
- Despite that, the agent still manage to crash/slow down training runs by putting many of them into background process at the same time. -> Locking mechanism was implemented to prevent 2 experiments being run at the same time. After this, the rate of progress increased to hundreds of runs per day.
- Persistent memory is important: Without forced logging, the agent would repeat experiments it already tried. The LOG.md and LEARNING.md system gives it memory across iterations.
The code open source (sanitized version): https://github.com/trantrikien239/autoresearch-tabularOf course it is done with Claude Code, but it has improved so much after rounds of iterations, including manual edits, so I think it's worth sharing.
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- [P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker SpotTL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub The Problem Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't. I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment? What I Built A parallel evolution pipeline on SageMaker Managed Spot Training: Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost. Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully Architecture: diagram Results Original (H100, sequential) This project (L40S Spot, parallel) Cost for 83 experiments ~$24 (on-demand) / ~$7 (spot) ~$1.33 Wall clock ~8 hours ~3.5 hours GPU idle cost ~50% wasted $0 Experiments in parallel 1 4 My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1). The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget. Surprises Along the Way Some things I learned the hard way: Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run aws ec2 get-spot-placement-scores before choosing a region. Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%). DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE. Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes. Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable. The Vibe Coding Angle The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step. Try It ```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml Edit with your AWS credentials make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ``` Links GitHub: https://github.com/roboco-io/serverless-autoresearch Tutorial: 8-chapter vibe coding tutorial Comparison Report: Original vs Serverless Spot Capacity Guide: How to find available Spot GPUs Key Insights: 12 battle-tested lessons What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers? Update: I wrote a full step-by-step tutorial documenting how this was built. If you want to learn by doing (not just read the code), I turned the entire build process into an 8-chapter hands-on tutorial: | Ch | What You'll Learn | |----|------------------| | 1 | How a single prompt + deep interview became the architecture | | 2 | 23 files generated in one session with parallel AI agents | | 3 | The region saga — Spot scores, quota wars, 3 region migrations | | 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success | | 5 | The Batch Size Trap — why doubling BS made results WORSE | | 6 | 5 generations of autonomous evolution (what worked vs what failed) | | 7 | Turning lessons into a reusable Claude Code skill | | 8 | Final scorecard: 18x cheaper, 2.3x faster | Every chapter includes the actual prompt I used, what went wrong, and exact commands to reproduce it. Total cost to follow along: ~$0.70. The most educational part is probably Chapter 5 (The Batch Size Trap) — I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson). Start here: Chapter 1: The Idea submitted by /u/Consistent-Milk-6643 [link] [comments]
- [P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can goExperiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice: on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973) on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976) What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small: 4.9M parameters trains in about 36 minutes on an RTX 4090 needs about 1 GB of GPU memory inference is below 2 ms on a single consumer GPU, so over 500 log events/sec For comparison, my previous approach took around 20 hours to train. The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs: 11M+ raw log lines 575,061 sessions 16,838 anomalous sessions (2.9%) This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas. The part that surprised me most was not just the score, but what actually made the difference. I started with a fairly standard NLP-style approach: BPE tokenizer relatively large model, around 40M parameters That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough. The breakthrough came when I stopped treating logs like natural language. Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type. So instead of feeding the model something like text, I feed it sequences like this: [5, 3, 7, 5, 5, 3, 12, 12, 5, ...] Where for example: "Receiving block blk_123 from 10.0.0.1" - Template #5 "PacketResponder 1 terminating" - Template #3 "Unexpected error deleting block blk_456" - Template #12 That one change did a lot at once: vocabulary dropped from about 8000 to around 50 model size shrank by roughly 10x training went from hours to minutes and, most importantly, the overfitting problem mostly disappeared The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped. The training pipeline was simple: Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like Finetune (classification): the model sees labeled normal/anomalous sessions Test: the model gets unseen sessions and predicts normal vs anomaly Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training. Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1. So in production this could be used with multiple thresholds, for example: > 0.7 = warning > 0.95 = critical Or with an adaptive threshold that tracks the baseline noise level of a specific system. A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice. Also, I definitely did not get here alone. This is a combination of: reading a lot of papers running automated experiment loops challenging AI assistants instead of trusting them blindly and then doing my own interpretation and tuning Very rough split: 50% reading papers and extracting ideas 30% automated hyperparameter / experiment loops 20% manual tuning and changes based on what I learned Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit. Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough. Curious what people here think: does this direction look genuinely promising to you? has anyone else tried SSMs / Mamba for log modeling? and which benchmark would you hit next: BGL, Thunderbird, or Spirit? If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked. P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before. https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8 submitted by /u/Adam_Jesion [link] [comments]
- I Trained an AI to Beat Final Fight… Here’s What Happened [p]Hey everyone, I’ve been experimenting with Behavior Cloning on a classic arcade game (Final Fight), and I wanted to share the results and get some feedback from the community. The setup is fairly simple: I trained an agent purely from demonstrations (no reward shaping initially), then evaluated how far it could go in the first stage. I also plan to extend this with GAIL + PPO to see how much performance improves beyond imitation. A couple of interesting challenges came up: Action space remapping (MultiBinary → emulator input) Trajectory alignment issues (obs/action offset bugs 😅) LSTM policy behaving differently under evaluation vs manual rollout Managing rollouts efficiently without loading everything into memory The agent can already make some progress, but still struggles with consistency and survival. I’d love to hear thoughts on: Improving BC performance with limited trajectories Best practices for transitioning BC → PPO Handling partial observability in these environments Here’s the code if you want to see the full process and results: notebooks-rl/final_fight at main · paulo101977/notebooks-rl Any feedback is very welcome! submitted by /u/AgeOfEmpires4AOE4 [link] [comments]
- Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipeline called StatForge that automates the statistical decision layer, writes APA methods, and lets you chat with your dataset using a microgpt-inspired retrieval system. Hey everyone, The hardest part of data analysis isn't the computation (we all have scipy and statsmodels). It's the plumbing—the sequence of choices between loading a CSV and having a defensible result. I built StatForge to handle the plumbing. How the pipeline works: Lazy Loading: Detects 15+ formats (CSV, Parquet, SPSS, SQLite) and lazily imports dependencies so you don't pay for bloat. Autonomous Assumption Checks: It doesn't just pass/fail normality. If a Shapiro-Wilk test returns a borderline p = 0.048, it flags it, runs both parametric and non-parametric tests, and compares the robustness of the results. The Plugin Registry: Uses a register decorator pattern for easy custom model injection. The microgpt Chat Mode: When Karpathy released his 200-line GPT, the way he loaded a corpus (docs: list[str]) changed how I looked at DataFrames. What if each row is a document? StatForge converts datasets into this format, scores rows against plain-English queries, pulls the top-k most relevant rows into a context window, and hits the Anthropic API (or a built-in rule engine). No vector DBs, no FAISS, just clean strings. You can run a full analysis with one command! I wrote a deep-dive on the architecture and the philosophy behind it here: https://shekhawatsamvardhan.medium.com/andrej-karpathy-dropped-a-200-line-gpt-d153e9557463 Repo is here if you want to break it or contribute: https://github.com/samvardhan03/statforge Would love to hear how you handle your own stats plumbing, or if there are specific edge cases the decision tree should catch! submitted by /u/Weary_Possible8913 [link] [comments]