LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

Our take

Explore the innovative findings from the "LLM rankings are not a ladder" study, presented on the LLM Win website. This platform transforms LLM benchmark results into a directed graph, revealing the complex relationships between models based on their performance. With a high weak-to-strong reachability rate of 94.2% and predominantly short paths, the results challenge traditional ranking perceptions. The data suggests that LLM capabilities are better understood through a benchmark-specific capability graph, prompting further exploration into how this structure can refine model evaluation metrics.

I built a small website called LLM Win:

https://llm-win.com

It turns LLM benchmark results into a directed graph:

text If model A beats model B on benchmark X, add an edge A -> B.

Then it searches for the shortest transitive chain between two models.

The meme version is:

text Can LLaMA 2 7B beat Claude Opus 4.7?

In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:

Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%.
Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking.
Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.
Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode
Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.

My current interpretation:

LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.

The next question is whether reversal structure can help build better evaluation metrics:

identify specialist models;
identify volatile benchmarks;
build robust generalist scores;
select complementary benchmark sets;
decompose models into capability fingerprints.

Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?

submitted by /u/Spico197
[link] [comments]

Tagged with

#no-code spreadsheet solutions#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#business intelligence tools#rows.com#conversational data analysis#real-time data collaboration#data analysis tools#big data management in spreadsheets#financial modeling with spreadsheets#intelligent data visualization#real-time collaboration#data visualization tools#enterprise data management#big data performance#data cleaning solutions#benchmark#LLM rankings#Intelligence Index