May 26, 2026•3 min read•from Machine Learning

The famous METR AI time horizons graph contains numerous severe errors [D]

Our take

In a critical analysis published in Transformer, Nathan Witkin highlights significant flaws in the METR AI time horizons graph, underscoring the dangers of relying on its conclusions. He argues that its numerous errors, including biased data collection and questionable measurement practices, render it fundamentally flawed. Witkin emphasizes the importance of rigorous scientific standards and peer review to prevent the proliferation of misleading information. For those interested in further exploring data integrity, check out "I Built My First ETL Pipeline as a Complete Beginner. Here’s How."

In a recent critique, Nathan Witkin shines a light on the shortcomings of the METR AI time horizons graph, revealing a troubling reality within the realm of AI research. His analysis underscores a significant issue: the reliance on flawed data can lead to misleading conclusions that ripple through the field, potentially skewing our understanding of AI capabilities. As demonstrated in this critique, the METR graph is not merely an isolated example of error; it reflects a broader pattern among researchers who may prioritize complexity and presentation over rigor and accuracy. This is a crucial conversation, especially as we navigate the evolving landscape of AI and data management, where clarity and precision are paramount.

Witkin points out that the METR graph’s numerous compounding errors prevent any meaningful conclusions from being drawn. These include reliance on anecdotal evidence, biased sampling of human benchmarkers, and a lack of empirical data. Such flaws are not just academic oversights; they can misinform organizations seeking to harness AI-driven solutions for productivity improvements. For instance, when users are evaluating tools for automating their workflows, as seen in articles like Automating Revenue Forecast Sheet based on Period of Performance and Deal Close Date and I Built My First ETL Pipeline as a Complete Beginner. Here’s How., they require reliable benchmarks to make informed decisions. A graph that misrepresents capabilities could lead them to adopt inefficient or ineffective solutions, ultimately hindering their productivity.

The implications of this critique extend beyond the METR graph itself; they beckon a critical examination of the standards applied in AI research. As Witkin notes, the issues identified within the METR framework are symptomatic of a wider pathology in AI research—an overemphasis on dramatic narratives often supported by flimsy data. This underscores the necessity for a more stringent adherence to scientific standards and best practices, particularly peer review processes that can help filter out flawed analyses before they gain traction within the community. The failure to uphold rigorous standards can lead to an environment where misinformation proliferates, undermining trust in AI technologies and their applications.

As we reflect on the importance of accuracy in AI research, it raises a broader question about the future of data management and technology. The METR critique serves as a reminder of the responsibility researchers and practitioners have to prioritize integrity and clarity over complexity and allure. In a rapidly advancing field, it’s essential that stakeholders—be they researchers, developers, or end-users—commit to a culture of transparency and accountability. By fostering an environment that values robust methodologies, we can ensure that the tools we develop and the data we rely upon genuinely reflect the potential of AI technologies.

Moving forward, the AI community must remain vigilant in distinguishing between genuine innovation and superficial claims. As we continue to explore transformative solutions in data management, we must also strive for a collective commitment to high-quality information that empowers users rather than misguides them. The question remains: how will we elevate the standards of research and practice in AI to avoid the pitfalls exemplified by the METR graph? As we seek to transform our workflows and enhance productivity, this commitment to integrity will be essential in shaping a future that truly harnesses the power of AI.

Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer:

It is impossible to draw meaningful conclusions from METR’s Long Tasks benchmark — in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut one’s losses and move on in search of higher-quality information.

… The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authors’ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks even more compromised than METR’s. One hopes that as the field matures, its participants will learn to stop making these mistakes.

The errors include:

Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors
A key variable in the data is how long it takes humans to complete certain tasks, but — when METR did actually measure this — it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer
The sample of human benchmarkers was biased toward METR employees’ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased)
Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had to spend time familiarizing themselves the codebase and the task at hand
Test-training data contamination occurred because some of the tasks had published solutions online, which most likely would have been included in LLMs’ training datasets
And many more

Please read the full post. It’s not too long and it’s accessible to general audience. It’s worthwhile to read the whole post and see how many errors were made in the creation of the METR graph and just how bad they are.

If you want to read about even more errors in the METR graph not covered in Nathan Witkin’s post, read this post by the AI researchers Gary Marcus and Ernest Davis.

The METR graph is a great example of why scientific standards and best practices are so important, and why enforcing them through processes like peer review is necessary to prevent us from drowning in bad information. It’s extremely dangerous to rely on information that only superficially appears scientific but wasn’t actually conducted with the rigour normally required of scientific research.

submitted by /u/common_yarrow
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Does automating the boring stuff in DS actually make you worse at your job long-termBeen thinking about this a lot lately after reading a few posts here about people noticing their skills slipping after leaning too hard on AI tools. There's a real tension between using automation to move faster and actually staying sharp enough to catch when something goes wrong. Like, automated data cleaning and dashboarding is genuinely useful, but if you're never doing, that work yourself anymore, you lose the instinct for spotting weird distributions or dodgy groupbys. There was a piece from MIT SMR recently that made a decent point that augmentation tends to win over straight replacement in the, long run, partly because the humans who stay engaged are the ones who can actually intervene when the model quietly does something dumb. And with agentic AI workflows becoming more of a baseline expectation in 2026, that intervention skill matters even, more since these pipelines are longer, more autonomous, and way harder to audit when something quietly goes sideways. The part that gets me is the deskilling risk nobody really talks about honestly. It's easy to frame everything as augmentation when really the junior work just disappears and, the oversight expectation quietly shifts to people who are also spending less time in the weeds. The ethical question isn't just about job numbers, it's about whether the people left are, actually equipped to catch failures in automated pipelines or whether we're just hoping they are. Curious if others have noticed their own instincts getting duller after relying on AI tools for, a while, or whether you've found ways to keep that hands-on feel even in mostly automated workflows. submitted by /u/taisferour [link] [comments]

The famous METR AI time horizons graph contains numerous severe errors [D]

Related Articles