[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop
Our take
I built Autochess NN, a browser-playable neural chess engine that started as a personal experiment in understanding AlphaZero-style systems by actually building one end to end.
This project was unapologetically vibecoded - but not in the “thin wrapper around an API” sense. I used AI heavily as a research/coding assistant in a Karpathy-inspired autoresearch workflow: read papers, inspect ideas, prototype, ablate, optimize, repeat. The interesting part for me was seeing how far that loop could go on home hardware (just ordinary gaming RTX 4090).
Current public V3:
- residual CNN + transformer
- learned thought tokens
- ~16M parameters
- 19-plane 8x8 input
- 4672-move policy head + value head
- trained on 100M+ positions
- pipeline: 2200+ Lichess supervised pretraining -> Syzygy endgame fine-tuning -> self-play RL with search distillation
- CPU inference + shallow 1-ply lookahead / quiescence (below 2ms)
I also wrapped it in a browser app so the model is inspectable, not just benchmarked: play vs AI, board editor, PGN import/replay, puzzles, and move analysis showing top-move probabilities and how the “thinking” step shifts them.
What surprised me is that, after a lot of optimization, this may have ended up being unusually compute-efficient for its strength - possibly one of the more efficient hobbyist neural chess engines above 2500 Elo. I’m saying that as a hypothesis to pressure-test, not as a marketing claim, and I’d genuinely welcome criticism on evaluation methodology.
I’m now working on V4 with a different architecture:
- CNN + Transformer + Thought Tokens + DAB (Dynamic Attention Bias) @ 50M parameters
For V5, I want to test something more speculative that I’m calling Temporal Look-Ahead: the network internally represents future moves and propagates that information backward through attention to inform the current decision.
Demo: https://games.jesion.pl
Project details: https://games.jesion.pl/about
Price: free browser demo. Nickname/email are only needed if you want to appear on the public leaderboard.
- The feedback I’d value most:
- Best ablation setup for thought tokens / DAB
- Better methodology for measuring Elo-vs-compute efficiency on home hardware
- Whether the Temporal Look-Ahead framing sounds genuinely useful or just fancy rebranding of something already known
- Ideas for stronger evaluation against classical engines without overclaiming
Cheers, Adam
[link] [comments]
Read on the original site
Open the publisher's page for the full experience
Related Articles
- [P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearchInspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary classification tasks (churn, conversion, etc.). You give it a dataset. It loops forever: analyze data, form hypothesis, edit code, run experiment, evaluate with expanding time windows (train on past, predict future - no leakage), keep or revert via git. It edits only 3 files - feature engineering, model hyperparams, and analysis code. Everything else is locked down. Edit: To clarify based on some comments, I am using this to solve the problem of finding new signals to add to the model, not trying to overfit a limited dataset. -end Edit- Key design decisions: Introducing an analysis loop in addition to the experiment loop, this allow for better reflection and experimentation. Optimize for experiment throughput with a bunch of decisions: Use LightGBM as default model, limit feature count and tree count, locking down training run until it finishes. Constrained editing surface: only 3 files + logs. No infrastructure changes, no package installs. Without this, the agent will eventually try to modify the evaluation code to "improve" its score. Docker sandbox - the agent runs with full shell access (--dangerously-skip-permissions). Container keeps it contained. Expanding time windows over k-fold - mean score across multiple temporal train/test splits. Forced logging - every experiment gets a LOG.md entry (hypothesis, result, takeaway). Significant insights go to LEARNING.md. You can read the agent's reasoning after the fact. Analysis primitives built-in - univariate AUC, correlation pairs, null rates, feature importance, error analysis. The agent writes analysis code using these to save time, they also serve as initial suggestions for the first few analyses. What I learned building this: Air-tight evaluation is the essential for real improvement - this lesson hit me twice: Earlier version didn't constraint which file the agent could edit, it eventually changed the evaluation code to make "improvement" easier for itself. K-fold validation was originally employed, the agent found improvements that are actually data leakage and didn't hold out-of-time. After a painful manual inspection, I switched over to expanding time windows. Do everything to protect experiment throughput - this lesson also hit twice: Initially, I let the model run wild and was not very impressed when it barely run 20 experiments overnight. Turns out, the agent engineered thousands of new features that slowed down training and crash some runs due to RAM limit. I added the feature count limit and tree count limit to make sure training time is reasonable. Despite that, the agent still manage to crash/slow down training runs by putting many of them into background process at the same time. -> Locking mechanism was implemented to prevent 2 experiments being run at the same time. After this, the rate of progress increased to hundreds of runs per day. Persistent memory is important: Without forced logging, the agent would repeat experiments it already tried. The LOG.md and LEARNING.md system gives it memory across iterations. The code open source (sanitized version): https://github.com/trantrikien239/autoresearch-tabularOf course it is done with Claude Code, but it has improved so much after rounds of iterations, including manual edits, so I think it's worth sharing. submitted by /u/Pancake502 [link] [comments]
- [P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker SpotTL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $0.44 total (vs ~$24 on an H100). 4x parallel execution, 2.3x faster, 18x cheaper. Includes an 8-chapter vibe coding tutorial. GitHub The Problem Karpathy's autoresearch is brilliant — an AI agent modifies training code, runs 5-minute experiments, keeps improvements, and repeats overnight. But it assumes you have an H100 sitting around for 8 hours. Most of us don't. I wanted to know: can you get the same results on cheap cloud GPUs, paying only pennies per experiment? What I Built A parallel evolution pipeline on SageMaker Managed Spot Training: Each generation: N candidates generated → N SageMaker Spot jobs run simultaneously → best val_bpb selected → next generation HUGI pattern (Hurry Up and Get Idle): GPUs spin up for 5 minutes, terminate immediately. Zero idle cost. Works with any GPU: H100, L40S, A10G — auto-detects and falls back gracefully Architecture: diagram Results Original (H100, sequential) This project (L40S Spot, parallel) Cost for 83 experiments ~$24 (on-demand) / ~$7 (spot) ~$1.33 Wall clock ~8 hours ~3.5 hours GPU idle cost ~50% wasted $0 Experiments in parallel 1 4 My actual run: 25 experiments across 5 generations for $0.44 on L40S (ml.g7e.2xlarge Spot in us-east-1). The pipeline autonomously discovered that EMBEDDING_LR is the most sensitive parameter, improving val_bpb from 1.0656 → 1.0643 through conservative LR evolution. Architecture changes (deeper models, bigger batches) all failed in the 5-minute budget. Surprises Along the Way Some things I learned the hard way: Spot capacity varies 1-9 by region. Same instance type: score 1 in us-west-2 (stuck for 30+ min), score 9 in us-east-1 (allocated in 2 min). Always run aws ec2 get-spot-placement-scores before choosing a region. Flash Attention 3 doesn't work on L40S. Pre-compiled FA3 kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89) crashes at runtime. Had to add a PyTorch SDPA fallback — which halved MFU (20% vs 40%). DEVICE_BATCH_SIZE ≠ throughput. Doubled batch size from 64→128, used 2x VRAM... and val_bpb got WORSE. Turns out with fixed TOTAL_BATCH_SIZE, larger micro-batches just reduce gradient accumulation steps without processing more tokens. The real lever is TOTAL_BATCH_SIZE. Larger Spot instances can be cheaper. g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($1.82/hr) because of lower demand. Check price history for all sizes. Cheap GPU experiments transfer to expensive GPUs. Research confirms that architecture/optimizer rankings found on L40S ($0.04/experiment) transfer to H100 for production training. Absolute LR values need re-tuning, but "A beats B" conclusions are portable. The Vibe Coding Angle The entire project was built through conversational AI coding (Claude Code) in a single ~13-hour session. I documented the full journey as an 8-chapter vibe coding tutorial — from initial idea through infrastructure debugging to autonomous evolution results. Every chapter includes the actual prompts used, the failures encountered, and the cost at each step. Try It ```bash git clone https://github.com/roboco-io/serverless-autoresearch cd serverless-autoresearch cp config.yaml.example config.yaml Edit with your AWS credentials make setup # IAM role make prepare # Data → S3 make dry-run # Verify (free) make run # 10 gen × 4 pop = 40 experiments (~$0.70) ``` Links GitHub: https://github.com/roboco-io/serverless-autoresearch Tutorial: 8-chapter vibe coding tutorial Comparison Report: Original vs Serverless Spot Capacity Guide: How to find available Spot GPUs Key Insights: 12 battle-tested lessons What's your cheapest setup for running ML experiments? Anyone tried autoresearch on other cloud providers? Update: I wrote a full step-by-step tutorial documenting how this was built. If you want to learn by doing (not just read the code), I turned the entire build process into an 8-chapter hands-on tutorial: | Ch | What You'll Learn | |----|------------------| | 1 | How a single prompt + deep interview became the architecture | | 2 | 23 files generated in one session with parallel AI agents | | 3 | The region saga — Spot scores, quota wars, 3 region migrations | | 4 | First experiment: FA3 CUDA crash → SDPA fallback → $0.02 success | | 5 | The Batch Size Trap — why doubling BS made results WORSE | | 6 | 5 generations of autonomous evolution (what worked vs what failed) | | 7 | Turning lessons into a reusable Claude Code skill | | 8 | Final scorecard: 18x cheaper, 2.3x faster | Every chapter includes the actual prompt I used, what went wrong, and exact commands to reproduce it. Total cost to follow along: ~$0.70. The most educational part is probably Chapter 5 (The Batch Size Trap) — I learned that DEVICE_BATCH_SIZE ≠ throughput the hard way ($0.07 lesson). Start here: Chapter 1: The Idea submitted by /u/Consistent-Milk-6643 [link] [comments]
- [P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can goExperiment #324 ended well. ;) This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark. Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study. What that means in practice: on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973) on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976) What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago. The model is small: 4.9M parameters trains in about 36 minutes on an RTX 4090 needs about 1 GB of GPU memory inference is below 2 ms on a single consumer GPU, so over 500 log events/sec For comparison, my previous approach took around 20 hours to train. The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs: 11M+ raw log lines 575,061 sessions 16,838 anomalous sessions (2.9%) This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas. The part that surprised me most was not just the score, but what actually made the difference. I started with a fairly standard NLP-style approach: BPE tokenizer relatively large model, around 40M parameters That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough. The breakthrough came when I stopped treating logs like natural language. Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type. So instead of feeding the model something like text, I feed it sequences like this: [5, 3, 7, 5, 5, 3, 12, 12, 5, ...] Where for example: "Receiving block blk_123 from 10.0.0.1" - Template #5 "PacketResponder 1 terminating" - Template #3 "Unexpected error deleting block blk_456" - Template #12 That one change did a lot at once: vocabulary dropped from about 8000 to around 50 model size shrank by roughly 10x training went from hours to minutes and, most importantly, the overfitting problem mostly disappeared The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped. The training pipeline was simple: Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like Finetune (classification): the model sees labeled normal/anomalous sessions Test: the model gets unseen sessions and predicts normal vs anomaly Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training. Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1. So in production this could be used with multiple thresholds, for example: > 0.7 = warning > 0.95 = critical Or with an adaptive threshold that tracks the baseline noise level of a specific system. A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice. Also, I definitely did not get here alone. This is a combination of: reading a lot of papers running automated experiment loops challenging AI assistants instead of trusting them blindly and then doing my own interpretation and tuning Very rough split: 50% reading papers and extracting ideas 30% automated hyperparameter / experiment loops 20% manual tuning and changes based on what I learned Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit. Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough. Curious what people here think: does this direction look genuinely promising to you? has anyone else tried SSMs / Mamba for log modeling? and which benchmark would you hit next: BGL, Thunderbird, or Spirit? If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked. P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before. https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8 submitted by /u/Adam_Jesion [link] [comments]