I ran 1 trillion Kentucky Derby simulations on a 1,000-vCPU cluster. Here’s what the model likes

Our take

I conducted an extensive analysis of the Kentucky Derby, running one trillion simulations on a 1,000-vCPU cloud cluster. This model, built using a combination of historical data and machine learning techniques, offers insights into race dynamics and potential outcomes. With the top pick, *Further Ado*, showing a significant win probability of 27.9%, this approach highlights the power of data-driven decision-making in horse racing. While not a guarantee of success, these findings provide a fresh perspective for enthusiasts and bettors alike.

One trillion simulations in under fifty minutes. That is not a flex—it is a window into what happens when spreadsheet-scale thinking meets cloud-native compute. The author of this model ran a Dirichlet weight search across sixteen historical Derbies, layered in a sklearn ensemble, and fired off a trillion Monte Carlo race simulations from a 1,000-vCPU cluster, all while posting a backtest that landed 126 out of 160 on a ranking metric and cleared a 2,000-permutation null test with p < 1/2000. The signal is real. And the fact that the entire pipeline lives in an open-source Python library called Burla, documented with full methodology and audit trail, says something important about where data modeling is heading. You do not need a hedge fund to explore what a trillion simulations can reveal. You need the right tools and the willingness to ask better questions of your data. For anyone watching how AI-native workflows are reshaping financial modeling, this is worth a closer look—especially alongside projects like Build AI Financial Models in Sourcetable that make similar ambitions accessible to smaller teams.

What makes this work compelling is not the scale. Scale alone is a commodity in 2025. The real insight is in the methodology discipline. The author treats the model as a math toy, not a prediction machine, and that distinction matters enormously. The backtest numbers are strong, but the honest caveats are stronger: two of the top model weights are placeholders, the model cannot see workouts or weather, and a trillion simulations does not override the fundamental uncertainty of a horse race. That kind of restraint is rare in a world saturated with overconfident dashboards and inflated claims. The model identifies real edges—Further Ado at 1.95x value over the morning line, Litmus Test at 1.91x, Intrepido and Robusta both at roughly 1.88x—but it also flags Renegade as a clear fade because post position one has not produced a Derby winner in the 2010–2025 sample. The takeaway is not "bet these horses." The takeaway is that structured, reproducible analysis can surface edges that gut feeling and handicapping folklore miss. And that principle applies whether you are modeling a horse race or building a quarterly revenue forecast.

There is a broader lesson hiding in the infrastructure choices here. The author used a 1,000-vCPU cloud cluster and finished in 48.9 minutes, then candidly noted the electric bill did not enjoy the experience. That tension between capability and cost is something anyone managing data operations should recognize. Running massive simulations is easier than ever, but the value still comes from knowing what to simulate, why, and how to interpret the output. A trillion runs mean nothing if your feature set is noisy or your null test is weak. The pipeline design—the Dirichlet weight search, the ensemble learning layer, the Monte Carlo framework—reveals someone who thinks about modeling as a system, not a spreadsheet trick. It is the kind of thinking that separates exploration from brute force.

As we move deeper into an era where cloud-native computation is a baseline expectation rather than a luxury, the question worth watching is not whether we can simulate more. It is whether we can ask sharper questions of the data we already have. The model here succeeded not because it threw a trillion darts at a wall, but because the underlying framework was built to distinguish signal from noise. That is the bar. And it is one worth meeting in every domain where people still default to manual aggregation and static reports when something more dynamic is within reach.

Built a Kentucky Derby model on a 1,000-vCPU cloud cluster.

https://burla-cloud.github.io/examples/kentucky-derby-demo/

Pipeline: Dirichlet weight search across 16 historical Derbies (2010 to 2025) + sklearn ensemble for ML probs + 1,000,000,000,000 Monte Carlo race sims. 48.9 minutes wall time. Yes, one trillion sims. No, my electric bill did not enjoy this.

Backtest landed 126/160 on a 10-5-2-1-0 ranking metric. 2,000-permutation null test (re-run after scrambling winner labels) puts p < 1/2000. Real signal, not search noise.

This is not financial advice. The model is a math toy, not a guarantee, and a trillion sims doesn't change the fact that a horse race is still a horse race.

Four scratches (Silent Tactic, Fulleffort, Right To Party, The Puma) cut the field to 19. All comparisons below are model win % vs morning-line implied %. Program posts (1, 2, 3, 4, 6, 7, 8, 10, 11, 12, 14, 15, 16, 17, 18, 19, 21, 22, 23) leave gaps where horses scratched and put the three also-eligibles (Great White, Ocelli, Robusta) on the deep outside.

Top win pick (BET)

Further Ado (post 18, 6-1). 27.9% vs 14.3% = 1.95x. Field-leading 106 Beyer. Cox / Velazquez. Drew the highest-historical-win-rate gate in the 2010-2025 sample (Authentic won from post 18 in 2020). The chalk is also the value play.

Four longshots tagged BET (model at least 1.5x morning-line implied)

Litmus Test (post 4, 30-1). 6.12% vs 3.20% = 1.91x. Baffert / Garcia. Beyer 96.
Intrepido (post 3, 50-1). 3.75% vs 2.00% = 1.88x. Berrios / Mullins. Beyer 89, Pace style.
Robusta (post 23, 50-1). 3.73% vs 2.00% = 1.86x. O'Neill again. Calumet homebred. Drew in from AE list when Right To Party scratched.
Pavlovian (post 16, 30-1). 5.58% vs 3.20% = 1.74x. O'Neill (2-for-Derby) / Maldonado. Beyer 90 sits one above field median. Post 16 is where Sovereignty won in 2025.

Top 5 by model win %

Further Ado, 27.90%
Chief Wallabee, 6.75%
Litmus Test, 6.12%
So Happy, 5.73%
Pavlovian, 5.58%

Headline fade

Renegade (post 1, 4-1). 4.2% vs 20.0% = 4.7x market over model, the biggest gap on the board. Post 1 has not produced a Derby winner in our 2010-2025 sample (none since Ferdinand 1986). Toss off the top of every ticket.

Honest caveats

Morning line, not closing tote. Renegade likely tightens, longshots drift.
Churchill takes ~17-22%. The five BETs (multipliers 1.74x to 1.95x) clear takeout. Further Ado is the only one stake-able at full bankroll; the four longshots stay as small saver tickets.
Two of the top-five model weights (dosage, career win-rate) are placeholder for 2026 (same value for every horse). The 2026 ranking effectively leans on year-Beyer, stamina-test, post-position win-rate, trainer/jockey edges, and run style.
Model can't see Ragozin / Thoro-Graph / today's workouts / closing tote / weather. Or how good your bourbon is.

Tickets (light stakes, ~$32 total)

$10 win on Further Ado at 6-1 (full-stake)
$3 win each on Litmus Test, Pavlovian, Intrepido, Robusta ($12)
$1 exacta box: Further Ado / Chief Wallabee / Litmus Test ($6)
10-cent superfecta box: Further Ado / Litmus Test / Pavlovian / Robusta ($2.40)

Disclosure: I built the model and I work on Burla, the open-source Python library that ran the cluster.

Full pipeline, methodology audit, and all 19 horses ranked: burla-cloud.github.io/examples/kentucky-derby-demo/#rankings

GL today, may your closer hit the wire first.

submitted by /u/Ok_Post_149
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

I ran 1 trillion Kentucky Derby simulations on a 1,000-vCPU cluster. Here’s what the model likes

Tagged with