May 13, 2026•7 min read•from VentureBeat

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

Our take

As AI language models evolve, their ability to rewrite document content raises critical concerns about reliability. A new study from Microsoft reveals that even leading models can introduce significant errors, degrading an average of 25% of document content during complex, multi-step workflows. This highlights the need for caution when delegating knowledge tasks to AI.

The recent study by Microsoft researchers highlights a pressing concern in the evolving landscape of AI and document management: the reliability of large language models (LLMs) during multi-step delegated workflows. As users increasingly delegate knowledge tasks to these models, the findings raise critical questions about the integrity of the content produced. With LLMs reportedly corrupting an average of 25% of document content during iterative processes, the implications for businesses and organizations are significant, particularly as they seek to streamline operations with AI integration. This issue is underscored by the ongoing dialogue in the AI community, as demonstrated in articles like Anthropic finally beat OpenAI in business AI adoption — but 3 big threats could erase its lead and AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

At the heart of the Microsoft study is the concept of "delegated work," a paradigm that allows users to trust LLMs to manage complex document tasks on their behalf. This trust is crucial, as many users may lack the time or expertise to meticulously review each modification made by AI. However, the study reveals a stark reality: these models can introduce not only errors but also distortions that are difficult to detect. The DELEGATE-52 benchmark developed by the researchers offers a unique approach to assessing model performance in real-world scenarios, emphasizing the need for accountability and transparency in AI-driven document handling. As organizations adopt AI more widely, this understanding is essential to mitigate risks associated with content degradation, especially in critical domains like finance and legal documentation.

One of the most significant takeaways from the study is the recommendation for organizations to implement incremental human review during long-horizon workflows. This practice counters the tendency of models to maintain a facade of accuracy for several steps before facing catastrophic failures, which could result in substantial losses or miscommunications. This nuanced perspective contrasts sharply with the prevailing narrative that paints a picture of fully autonomous AI agents as a near-term reality. Instead, the findings suggest that organizations must remain vigilant and proactive, incorporating robust review mechanisms into their AI workflows. This need for oversight reinforces the idea that while automation can enhance productivity, it is not a panacea.

Moreover, the study highlights the risks of relying on generic tools for AI operations, as seen when models were given agentic harnesses that led to increased degradation. This insight points to an opportunity for developers to create domain-specific tools that enhance the efficacy of AI applications. The performance of models varies widely across different domains, with notable success in programming tasks and significant struggles in natural language processing. This variability emphasizes the necessity for tailored solutions that align with specific organizational needs and workflows.

As we look ahead, the implications of this research extend beyond merely understanding the limitations of current AI models. Organizations must weigh the benefits of automation against the potential for error and misinformation. The question that remains is how quickly and effectively AI technology can evolve to meet these challenges. With models improving at a rapid pace, as noted by Philippe Laban, it is conceivable that future iterations will achieve higher reliability scores. However, will this advancement be enough to allay the concerns of organizations hesitant to fully embrace autonomous workflows? The balance between leveraging AI's potential and ensuring content integrity will undoubtedly shape the future of document management and organizational efficiency. The journey towards reliable AI is just beginning, and its trajectory will be worth watching closely.

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds?

A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time.

Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance.

This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks.

The mechanics of delegated work

The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents.

A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories.

Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents.

To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation.

Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks.

Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version.

Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger.

In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit "undo." Because human workers cannot be forced to instantly "forget" a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently.

The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are simply attempting each task as thoroughly as they can at each step."

These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data.

Testing frontier models in the relay

To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions.

Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content.

Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains.

Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document's content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds.

Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error.

Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones.

"Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes," he noted. "When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track.

Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation.

"For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks," Laban said. "Single-turn measurements systematically underestimate the harm of imprecise retrieval."

Reality check for the autonomous enterprise

The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents.

The benchmark's design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription.

For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that "… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested.

Laban is optimistic about the rate of improvement. "Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52."

However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.

Read on the original site

Open the publisher's page for the full experience

View original article →

Frontier models are failing one in three production attempts — and getting harder to auditAI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defining operational challenge for IT leaders in 2026, according to Stanford HAI's ninth annual AI Index report. This uneven, unpredictable performance is what the AI Index calls the "jagged frontier," a term coined by AI researcher Ethan Mollick to describe the boundary where AI excels and then suddenly fails. “AI models can win a gold medal at the International Mathematical Olympiad,” Stanford HAI researchers point out, “but still can’t reliably tell time.” How models advanced in 2025 Enterprise AI adoption has reached 88%. Notable accomplishments in 2025 and early 2026: Frontier models improved 30% in just one year on Humanity's Last Exam (HLE), which includes 2,500 questions across math, natural sciences, ancient languages, and other specialized subfields. HLE was built to be difficult for AI and favorable to human experts. Leading models scored above 87% on MMLU-Pro, which tests multi-step reasoning based on 12,000 human-reviewed questions across more than a dozen disciplines. This illustrates “how competitive the frontier has become on broad knowledge tasks,” the Stanford HAI researchers note. Top models including Claude Opus 4.5, GPT-5.2, and Qwen3.5 scored between 62.9% and 70.2% on τ-bench. The benchmark tests agents on real-world tasks in realistic domains that involve chatting with a user and calling external tools or APIs. Model accuracy on GAIA, which benchmarks general AI assistants, rose from about 20% to 74.5%. Agent performance on SWE-bench Verified rose from 60% to near 100% in just one year. The benchmark evaluates models on their ability to resolve real-world software issues. Success rates on WebArena increased from 15% in 2023 to 74.3% in early 2026. This benchmark presents a realistic web environment for evaluating autonomous AI agents, tasking them with information retrieval, site navigation, and content configuration. Agent performance progressed from 17% in 2024 to roughly 65% in early 2026 on MLE-bench, which evaluates machine learning (ML) engineering capabilities. AI agents are showing capability gains in cybersecurity. For instance, frontier models solved 93% of problems on Cybench, a benchmark that includes 40 professional-level tasks across six capture-the-flag categories, including cryptography, web security, reverse engineering, forensics, and exploitation. This is compared to 15% in 2024 and represents the “steepest improvement rate,” indicating that cybersecurity tasks are a “good fit for current agent capabilities.” Video generation has also evolved significantly over the last year; models can now capture how objects behave. For instance, Google DeepMind’s Veo 3 was tested across more than 18,000 generated videos, and demonstrated the ability to simulate buoyancy and solved mazes without having been trained on those tasks. “Video generation models are no longer just producing realistic-looking content,” the researchers write. “Some are beginning to learn how the physical world actually works.” Overall, AI is being used across a number of areas in enterprise — knowledge management, software engineering and IT, marketing and sales — and expanding into specialized domains like tax, mortgage processing, corporate finance, and legal reasoning, where accuracy ranges from 60 to 90%. “AI capability is not plateauing,” Stanford HAI says. “It is accelerating and reaching more people than ever.” AI capability surges, but reliability lags Multimodal models now meet or exceed human baselines on PhD-level science questions, multimodal reasoning, and competition mathematics. For example, Gemini Deep Think earned a gold medal at the 2025 International Mathematical Olympiad (IMO), solving five of six problems end-to-end in natural language within the 4.5-hour time limit — a notable improvement from a silver-level score in 2024. Yet these same AI systems still fail in roughly one in three attempts, and have trouble with basic perception tasks, according to Stanford HAI. On ClockBench — a test covering 180 clock designs and 720 questions — Gemini Deep Think achieved only 50.1% accuracy, compared to roughly 90% for humans. GPT-4.5 High reached an almost identical score of 50.6%. “Many multimodal models still struggle with something most humans find routine: Telling the time,” the Stanford HAI report points out. The seemingly simple task combines visual perception with simple arithmetic, identification of clock hands and their positions, and conversion of those into a time value. Ultimately, errors at any of these steps can cascade, leading to incorrect results, according to researchers. In analysis, models were shown a range of clock styles: standard analog, clocks without a second hand, those with arrows as hands, others with black dials or Roman numerals. But even after fine-tuning on 5,000 synthetic images, models improved only on familiar formats and failed to generalize to real-world variations (like distorted dials or thinner hands). Researchers extrapolated that, when models confused hour and minute hands, their ability to interpret direction deteriorated, suggesting that the challenge lies not just in data, but in integrating multiple visual cues. “Even as models close the gap with human experts on knowledge-intensive tasks, this kind of visual reasoning remains a persistent challenge,” Stanford HAI notes. Hallucination and multi-step reasoning remain major gaps Even as models continue to accelerate in their reasoning, hallucinations remain a major concern. In one benchmark, for instance, hallucination rates across 26 leading models ranged from 22% to 94%. Accuracy for some models dropped sharply when put under scrutiny —for example, GPT-4o's accuracy slid from 98.2% to 64.4%, and DeepSeek R1 plummeted from more than 90% to 14.4%. On the other hand, Grok 4.20 Beta, Claude 4.5 Haiku, and MiMo-V2-Pro showed the lowest rates. Further, models continue to struggle with multi-step workflows, even as they are tasked with more of them. For example, on the τ-bench benchmark — which evaluates tool use and multi-turn reasoning — no model exceeded 71%, suggesting that “managing multiturn conversations while correctly using tools and following policy constraints remains difficult even for frontier models,” according to the Stanford HAI report. Models are becoming opaque Leading models are now “nearly indistinguishable” from each other when it comes to performance, the Stanford HAI report notes. Open-weight models are more competitive than ever, but they are converging. As capability is no longer a “clear differentiator,” competitive pressure is shifting toward cost, reliability, and real-world usefulness. Frontier labs are disclosing less information about their models, evaluation methods are quickly losing relevance, and independent testing can’t always corroborate developer-reported metrics. As Stanford HAI points out: “The most capable systems are now the least transparent.” Training code, parameter counts, dataset sizes, and durations are often being withheld — by firms including OpenAI, Anthropic and Google. And transparency is declining more broadly: In 2025, 80 out of 95 models were released without corresponding training code, while only four made their code fully open source. Further, after rising between 2023 and 2024, scores on the Foundation Model Transparency Index — which ranks major foundation developers on 100 transparency indicators — have since dropped. The average score is now 40, representing a 17 point decrease. “Major gaps persist in disclosure around training data, compute resources, and post-deployment impact,” according to the report. Benchmarking AI is getting harder — and less reliable The benchmarks used to measure AI progress are facing growing reliability issues, with error rates reaching as high as 42% on widely-used evaluations. “AI is being tested more ambitiously across reasoning, safety, and real-world task execution,” the Stanford report notes, yet “those measurements are increasingly difficult to rely on.” Key challenges include: “Sparse and declining” reporting on bias from developers Benchmark contamination, or when models are exposed to test data; this can lead to “falsely inflated” scores Discrepancies between developer-reported results and independent testing “Poorly constructed” evals lacking documentation, details on statistical significance and reproducible scripts “Growing opacity and non-standard prompting” that make model-to-model comparisons unreliable “Even when benchmark scores are technically valid, strong benchmark performance does not always translate to real-world utility,” according to the report. Further, “AI capability is outpacing the benchmarks designed to measure it.” This is leading to “benchmark saturation,” where models achieve scores so high that tests can no longer differentiate between them. More complex, interactive forms of intelligence are becoming increasingly difficult to benchmark. Some are calling for evals that measure human-AI collaboration, rather than AI performance in isolation, but this technique is early in development. “Evaluations intended to be challenging for years are saturated in months, compressing the window in which benchmarks remain useful for tracking progress,” according to Stanford HAI. Are we at "peak data"? As builders move into more data-intensive inference, there is growing concern about data bottlenecks and scaling sustainability. Leading researchers are warning that the available pool of high-quality human text and web data has been “exhausted” — a state referred to as “peak data.” Hybrid approaches combining real and synthetic data can “significantly accelerate training” — sometimes by a factor of 5 to 10 — and smaller models trained on purely synthetic data have shown promise for narrowly defined tasks like classification or code generation, according to Stanford HAI. Synthetically generated data can be effective for improving model performance in post-training settings, including fine-tuning, alignment, instruction tuning, and reinforcement learning (RL), the report notes. However, “these gains have not generalized to large, general-purpose language models.” Rather than scaling data “indiscriminately,” researchers are turning to pruning, curating, and refining inputs, and are improving performance by cleaning labels, deduplicating samples, and constructing overall higher-quality datasets. “Discussions on data availability often overlook an important shift in recent AI research,” according to the report. “Performance gains are increasingly driven by improving the quality of existing datasets, not by acquiring more.” Responsible AI is falling behind While the infrastructure for responsible AI is growing, progress has been “uneven” and is unable to keep pace with rapid capability gains, according to Stanford HAI. While almost all leading frontier AI model developers report results on capability benchmarks, corresponding reporting on safety and responsibility is inconsistent and “spotty.” Documented AI incidents rose significantly year over year — 362 in 2025 compared to 233 in 2024. And, while several frontier models received “Very Good” or “Good” safety ratings under standard use (per the AILuminate benchmark, which assesses generative AI across 12 “hazard” categories), safety performance dropped across all models when tested against jailbreak attempts using adversarial prompts. “AI models perform well on safety tests under normal conditions, but their defenses weaken under deliberate attack,” Stanford HAI notes. Adding to this challenge, builders have reported that improving one dimension, such as safety, can degrade another, like accuracy. “The infrastructure for responsible AI is growing, but progress has been uneven, and it is not keeping pace with the speed of AI deployment,” according to Stanford researchers. The Stanford data makes one thing clear: the gap that matters in 2026 isn't between AI and human performance. It's between what AI can do in a demo and what it does reliably in production. Right now — with less transparency from the labs and benchmarks that saturate before they're useful — that gap is harder to measure than ever.

Tagged with

#natural language processing for spreadsheets#Excel alternatives for data analysis#generative AI for data analysis#automation in spreadsheet workflows#enterprise data management#real-time data collaboration#natural language processing#data visualization tools#data analysis tools#financial modeling with spreadsheets#enterprise-level spreadsheet solutions#big data performance#self-service analytics tools#conversational data analysis#business intelligence tools#collaborative spreadsheet tools#real-time collaboration#data cleaning solutions#big data management in spreadsheets#intelligent data visualization

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

The mechanics of delegated work

Testing frontier models in the relay

Reality check for the autonomous enterprise

Related Articles

Tagged with