[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot
Our take
TL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub
The Problem
Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't.
I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment?
What I Built
A parallel evolution pipeline on SageMaker Managed Spot Training:
- Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation
- HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost.
- Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully
Architecture: diagram
Results
| Original (H100, sequential) | This project (L40S Spot, parallel) | |
|---|---|---|
| Cost for 83 experiments | ~$24 (on-demand) / ~$7 (spot) | ~$1.33 |
| Wall clock | ~8 hours | ~3.5 hours |
| GPU idle cost | ~50% wasted | $0 |
| Experiments in parallel | 1 | 4 |
My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1).
The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget.
Surprises Along the Way
Some things I learned the hard way:
Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run
aws ec2 get-spot-placement-scoresbefore choosing a region.Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%).
DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE.
Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes.
Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable.
The Vibe Coding Angle
The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step.
Try It
```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml
Edit with your AWS credentials
make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ```
Links
- GitHub: https://github.com/roboco-io/serverless-autoresearch
- Tutorial: 8-chapter vibe coding tutorial
- Comparison Report: Original vs Serverless
- Spot Capacity Guide: How to find available Spot GPUs
- Key Insights: 12 battle-tested lessons
What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers?
Update: I wrote a full step-by-step tutorial documenting how this was built.
If you want to learn by doing (not just read the code), I turned the entire
build process into an 8-chapter hands-on tutorial:
| Ch | What You'll Learn |
|----|------------------|
| 1 | How a single prompt + deep interview became the architecture |
| 2 | 23 files generated in one session with parallel AI agents |
| 3 | The region saga — Spot scores, quota wars, 3 region migrations |
| 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success |
| 5 | The Batch Size Trap — why doubling BS made results WORSE |
| 6 | 5 generations of autonomous evolution (what worked vs what failed) |
| 7 | Turning lessons into a reusable Claude Code skill |
| 8 | Final scorecard: 18x cheaper, 2.3x faster |
Every chapter includes the actual prompt I used, what went wrong,
and exact commands to reproduce it. Total cost to follow along: ~$0.70.
The most educational part is probably Chapter 5 (The Batch Size Trap) —
I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson).
Start here: Chapter 1: The Idea
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- [P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can goExperiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice: on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973) on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976) What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small: 4.9M parameters trains in about 36 minutes on an RTX 4090 needs about 1 GB of GPU memory inference is below 2 ms on a single consumer GPU, so over 500 log events/sec For comparison, my previous approach took around 20 hours to train. The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs: 11M+ raw log lines 575,061 sessions 16,838 anomalous sessions (2.9%) This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas. The part that surprised me most was not just the score, but what actually made the difference. I started with a fairly standard NLP-style approach: BPE tokenizer relatively large model, around 40M parameters That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough. The breakthrough came when I stopped treating logs like natural language. Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type. So instead of feeding the model something like text, I feed it sequences like this: [5, 3, 7, 5, 5, 3, 12, 12, 5, ...] Where for example: "Receiving block blk_123 from 10.0.0.1" - Template #5 "PacketResponder 1 terminating" - Template #3 "Unexpected error deleting block blk_456" - Template #12 That one change did a lot at once: vocabulary dropped from about 8000 to around 50 model size shrank by roughly 10x training went from hours to minutes and, most importantly, the overfitting problem mostly disappeared The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped. The training pipeline was simple: Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like Finetune (classification): the model sees labeled normal/anomalous sessions Test: the model gets unseen sessions and predicts normal vs anomaly Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training. Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1. So in production this could be used with multiple thresholds, for example: > 0.7 = warning > 0.95 = critical Or with an adaptive threshold that tracks the baseline noise level of a specific system. A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice. Also, I definitely did not get here alone. This is a combination of: reading a lot of papers running automated experiment loops challenging AI assistants instead of trusting them blindly and then doing my own interpretation and tuning Very rough split: 50% reading papers and extracting ideas 30% automated hyperparameter / experiment loops 20% manual tuning and changes based on what I learned Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit. Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough. Curious what people here think: does this direction look genuinely promising to you? has anyone else tried SSMs / Mamba for log modeling? and which benchmark would you hit next: BGL, Thunderbird, or Spirit? If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked. P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before. https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8 submitted by /u/Adam_Jesion [link] [comments]
- Qwen3 4B outperforms cloud agents on code tasks—with Mahoraga research [R]Hey everyone in ML. I've been working on Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a contextual bandit (LinUCB) that learns from every decision. Context (skip): I only started integrating AI into my workflows in late 2025, so I came on the scene broke with no credits. This left me with local models. However, many students and employees also receive credits from their institution to work with. (I got claude yippee) I wanted to be able to flawlessly route between models when credits ran out, which made me build an orchestrator. I used to use claude more as a chatbot/complete workflow engine, which made it difficult to use local models due to the context window, reasoning, etc. Opus 4.5 running open-source "superpowers" ate my usage every month. Now I realize that wasn't an effective way to use claude, or AI in general. I was using claude for both heavy planning/brainstorming and minor tasks. How about tasks specifically for code generation? Code generation is a relatively constrained task, with correct answers and short outputs. Surely local models can compete in tasks that don't need cloud? So I switched Mahoraga to an adaptable router. I ran 192 tasks across 8 agents (4 local Ollama models, 4 cloud CLIs) on a 16GB MacBook Pro, forcing round-robin so every agent got every prompt. Quality is scored by a 4-layer heuristic system (novelty ratio, structural checks, embedding similarity, length ratio). Zero API cost for evaluation, and no LLM-as-judge. Forced round-robin, no bandit selection. 4-layer heuristic quality scoring. Hardware: MacBook Pro 16GB M-series (Nov 2024). Qwen3 4B in nothink mode dominates code and refactor at 33.8 t/s and 6.1s average latency. Cloud agents cluster around 0.650 on code. The local model isn't just cheaper; it's actmeasurably better for this task class. Other findings: LFM2 hits 77.1 t/s but trades ~5 quality points vs Qwen3 4B DeepSeek-R1 averages 123.5s per task on 16GB. The reasoning overhead makes it unusable as a default Security scores are flat at 0.650 across all agents due to my human error—the scorer doesn't capture security-specific signals well. The bandit (LinUCB) is the only routing strategy with sublinear regret (β=0.659) across a 200-task simulation—it actually converges The routing works in two stages: the keyword classifier puts the task in a capability bucket (code, plan, research, etc.), and then the bandit picks the best agent within that bucket. 9-dimensional context vector, persistent state across sessions, warm-start from the compatibility matrix. All local inference, all free. Cloud escalation exists but only fires on retry. Why pay for cloud when a local model handles it better? Looking for any feedback, any input. Feel free to be critical: I appreciate everyone who interacts on this subreddit. I will continue to work on this in the future. A star would be appreciated: github.com/pockanoodles/Mahoraga submitted by /u/Own-Professional3092 [link] [comments]
- Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]Sharing an open-source benchmark suite (paper-lantern-challenges) that measures coding-agent performance with vs without retrieval-augmented technique selection across 9 everyday software tasks. Disclosure: I'm the author of the retrieval system under test (paperlantern.ai/code); the artifact being shared here is the benchmark suite itself, not the product. Every prompt, agent code path, and prediction file is in the repo and reproducible. Setup. Same coding agent (Claude Opus 4.6 as the planner, Gemini Flash 3 as the task model), same input data, same evaluation scripts across all 9 tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, summarization evaluation. Independent variable: whether the agent could call a retrieval tool over CS literature before writing its solution. One pass per task, no retries, no manual filtering of outputs. Task selection. Tasks were chosen to span the everyday-engineering surface a coding agent actually faces, not specialized ML scenarios. Selection criteria: (1) unambiguous quantitative metric, (2) baseline performance well below ceiling, (3) standard datasets where they exist, (4) eval reproducible on a free Gemini API key in roughly 10 minutes per task. Eval methodology. Each task uses its task-standard quantitative metric (mutation score for test_generation, execution accuracy for text_to_sql, F1 on labeled spans for the extraction tasks, weighted F1 for classification, etc.). Full per-task scripts and dataset choices are in the repo - one directory per task, evaluate.py as the entry point, README.md per task documenting methodology and dataset. Retrieval setup. The "with retrieval" agent has access to three tool calls: explore_approaches(problem) returns ranked candidate techniques from the literature, deep_dive(technique) returns implementation steps and known failure modes for a chosen technique, compare_approaches(candidates) is for side-by-side when multiple options look viable. The agent decides when and how often to call them. Latency is roughly 20s per call; results cache across sessions. The baseline agent has none of these tools, otherwise identical scaffolding. Comparability. Both agents share the same task-specific user prompt; the only system-prompt difference is the retrieval agent's tool-call grammar. Predictions and per-task prompts are diffable in the repo (baseline/ and with_pl/ subdirectories per task). Results. Task Baseline With retrieval Delta extraction_contracts 0.444 0.764 +0.320 extraction_schemas 0.318 0.572 +0.254 test_generation 0.625 0.870 +0.245 classification 0.505 0.666 +0.161 few_shot 0.193 0.324 +0.131 code_review 0.351 0.395 +0.044 text_to_sql 0.650 0.690 +0.040 routing 0.744 0.761 +0.017 summeval 0.623 0.633 +0.010 The test-generation delta came from the agent discovering mutation-aware prompting - the techniques are MuTAP and MUTGEN - which enumerate every AST-level mutation of the target and require one test per mutation. Baseline wrote generic tests from pretrain priors. The contract extraction delta came from BEAVER (section-level relevance scoring) and PAVE (post-extraction validation), both 2026 techniques that post-date the agent's training. 10 of the 15 most-cited sources across the experiments were published in 2025 or later, which is the conservative argument for why retrieval matters: the agent could not have reached these techniques from parametric memory. Failure modes. Self-refinement hurt text-to-SQL (the agent second-guessed correct queries after reading work on SQL ambiguity). Two suggested techniques (DyT, SeeDNorm) were architecture-incompatible in the autoresearch experiment and got discarded. Retrieval surfaces better options, not guaranteed wins. Reproducibility. Every prompt, every line of agent code, every prediction file, every eval script is in the repo. Each task directory has a README documenting methodology and an approach.md showing exactly what the retrieval surfaced and which technique the agent chose. Repo: https://github.com/paperlantern-ai/paper-lantern-challenges Writeup with detailed per-task discussion: https://www.paperlantern.ai/blog/coding-agent-benchmarks Happy to share additional design choices in comments. submitted by /u/kalpitdixit [link] [comments]
- Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipeline called StatForge that automates the statistical decision layer, writes APA methods, and lets you chat with your dataset using a microgpt-inspired retrieval system. Hey everyone, The hardest part of data analysis isn't the computation (we all have scipy and statsmodels). It's the plumbing—the sequence of choices between loading a CSV and having a defensible result. I built StatForge to handle the plumbing. How the pipeline works: Lazy Loading: Detects 15+ formats (CSV, Parquet, SPSS, SQLite) and lazily imports dependencies so you don't pay for bloat. Autonomous Assumption Checks: It doesn't just pass/fail normality. If a Shapiro-Wilk test returns a borderline p = 0.048, it flags it, runs both parametric and non-parametric tests, and compares the robustness of the results. The Plugin Registry: Uses a register decorator pattern for easy custom model injection. The microgpt Chat Mode: When Karpathy released his 200-line GPT, the way he loaded a corpus (docs: list[str]) changed how I looked at DataFrames. What if each row is a document? StatForge converts datasets into this format, scores rows against plain-English queries, pulls the top-k most relevant rows into a context window, and hits the Anthropic API (or a built-in rule engine). No vector DBs, no FAISS, just clean strings. You can run a full analysis with one command! I wrote a deep-dive on the architecture and the philosophy behind it here: https://shekhawatsamvardhan.medium.com/andrej-karpathy-dropped-a-200-line-gpt-d153e9557463 Repo is here if you want to break it or contribute: https://github.com/samvardhan03/statforge Would love to hear how you handle your own stats plumbing, or if there are specific edge cases the decision tree should catch! submitted by /u/Weary_Possible8913 [link] [comments]