5 min readfrom Machine Learning

[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot

Our take

Unlock the potential of Karpathy's autoresearch for just $0.44 by leveraging an innovative open-source pipeline on SageMaker Spot instances. This solution enables 25 autonomous ML experiments to run in parallel, achieving results faster and at a fraction of the cost compared to traditional setups. With 4x parallel execution and a comprehensive 8-chapter coding tutorial, you will gain insights into optimizing machine learning experiments efficiently. Discover how this approach transforms data exploration while minimizing costs and maximizing outcomes. Ready to revolutionize your ML workflows?

TL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub


The Problem

Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't.

I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment?

What I Built

A parallel evolution pipeline on SageMaker Managed Spot Training:

  • Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation
  • HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost.
  • Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully

Architecture: diagram

Results

Original (H100, sequential) This project (L40S Spot, parallel)
Cost for 83 experiments ~$24 (on-demand) / ~$7 (spot) ~$1.33
Wall clock ~8 hours ~3.5 hours
GPU idle cost ~50% wasted $0
Experiments in parallel 1 4

My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1).

The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget.

Surprises Along the Way

Some things I learned the hard way:

  1. Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run aws ec2 get-spot-placement-scores before choosing a region.

  2. Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%).

  3. DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE.

  4. Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes.

  5. Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable.

The Vibe Coding Angle

The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step.

Try It

```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml

Edit with your AWS credentials

make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ```

Links

What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers?


Update: I wrote a full step-by-step tutorial documenting how this was built.

If you want to learn by doing (not just read the code), I turned the entire
build process into an 8-chapter hands-on tutorial:

| Ch | What You'll Learn |
|----|------------------|
| 1 | How a single prompt + deep interview became the architecture |
| 2 | 23 files generated in one session with parallel AI agents |
| 3 | The region saga — Spot scores, quota wars, 3 region migrations |
| 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success |
| 5 | The Batch Size Trap — why doubling BS made results WORSE |
| 6 | 5 generations of autonomous evolution (what worked vs what failed) |
| 7 | Turning lessons into a reusable Claude Code skill |
| 8 | Final scorecard: 18x cheaper, 2.3x faster |

Every chapter includes the actual prompt I used, what went wrong,
and exact commands to reproduce it. Total cost to follow along: ~$0.70.

The most educational part is probably Chapter 5 (The Batch Size Trap)
I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson).

Start here: Chapter 1: The Idea

submitted by /u/Consistent-Milk-6643
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Related Articles

Tagged with

#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#no-code spreadsheet solutions#financial modeling with spreadsheets#AI formula generation techniques#conversational data analysis#cloud-based spreadsheet applications#real-time data collaboration#cloud-native spreadsheets#big data management in spreadsheets#large dataset processing#row zero#intelligent data visualization#real-time collaboration#natural language processing#data visualization tools#enterprise data management#big data performance