8 min readfrom VentureBeat

Xiaomi's HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most

Our take

Xiaomi's HarnessX introduces a transformative approach to AI agent development, autonomously rewriting its own scaffolding mid-task—a technique that yields particularly impressive gains for smaller models. Addressing a critical engineering bottleneck, HarnessX treats the AI harness as a modular object, enabling dynamic adaptation to application-specific requirements. Practical results demonstrate an average +14.5% performance boost, with the open-weight Qwen3.5-9B model achieving a remarkable +44% improvement on embodied planning tasks, signaling that harness evolution can be a powerful alternative to simply scaling foundation models.
Xiaomi's HarnessX rewrites its own AI scaffolding mid-task — and smaller models gain the most

The relentless pursuit of more capable AI agents has largely focused on scaling foundation models—bigger, more parameters, more training data. However, as enterprise AI agents tackle increasingly complex, long-horizon tasks, it’s becoming clear that the infrastructure surrounding these models, the “harness,” is a critical bottleneck. Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment. This manual process is time-consuming and prone to error, limiting the potential of even the most advanced language models. The Xiaomi team’s HarnessX offers a compelling alternative, demonstrating that intelligent harness evolution can significantly boost agent performance, even for smaller models—a development that aligns with the broader trend of optimizing existing resources rather than solely chasing exponential growth, as seen in Alibaba's recent work with Qwen-AgentWorld Alibaba's model never trained as an agent — and improved agent performance across seven benchmarks.

HarnessX’s approach, treating the harness as a composable object and employing an automated engine called AEGIS, is genuinely innovative. The ability to dynamically adjust the harness based on execution data—essentially, allowing the AI agent’s scaffolding to learn and adapt—represents a significant leap forward. This isn't just about incremental improvements; it's about unlocking entirely new capabilities. The modular design, breaking down agent behavior into distinct "processors," allows for targeted optimization and avoids the architectural entanglement that plagues many existing systems. It's a paradigm shift, moving away from the traditional "set and forget" approach to a more iterative and responsive model. Furthermore, the researchers’ emphasis on co-evolution—training the model *and* the harness simultaneously—is particularly noteworthy, highlighting the interdependence of these components and paving the way for more synergistic AI development, a concept also explored in Mindstone’s Rebel system Your enterprise AI agents should automatically remember which model is right for which task. Mindstone built the capability with Rebel.

The results presented are striking, particularly the +44% performance gain observed with the Qwen3.5-9B model. This underscores a crucial point: scaling the foundation model isn’t always the optimal solution. For organizations constrained by compute resources or seeking to maximize the value of existing models, HarnessX offers a practical and potentially more cost-effective path to improved AI performance. The anecdotal examples—the automated correction of browser timeouts and the elimination of pagination loops—further illustrate the system’s ability to address real-world challenges that often trip up even sophisticated AI agents. While the current reliance on powerful models like Claude Opus as the "meta-agent" introduces a dependency, the researchers correctly acknowledge this as a temporary limitation and anticipate improvements in open-weight models will mitigate it over time.

Ultimately, HarnessX isn’t just a technical innovation; it's a philosophical one. It shifts the focus from solely increasing the size and complexity of the model itself to optimizing the environment in which it operates—acknowledging that intelligence isn’t solely about the brain but also about the tools and context that surround it. The success of this approach begs the question: as AI agents become increasingly interwoven with our workflows, will we see a surge in research and development focused on intelligent harness engineering, transforming it from a neglected afterthought into a core pillar of AI development?

As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment. 

Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment.

To address this engineering bottleneck, researchers at Xiaomi introduced HarnessX, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code. 

In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed HarnessX delivering substantial performance gains across domains like software engineering and web interaction. 

The results demonstrate that scaling the foundation model is not the only path to more capable AI — and for smaller models, it may not even be the best one. HarnessX's harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen3.5-9B, gains reached +44% on embodied planning tasks.

The challenges of harness engineering

In AI applications, a foundation model's capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action. 

As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges.

First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences.

Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition.

Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent's operational data.

HarnessX: an autonomous foundry for AI agents

HarnessX solves the engineering bottlenecks of manual harness development with what the researchers call a “unified harness foundry.” 

The core innovation of HarnessX is treating the harness as a "first-class object". In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model.

HarnessX breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a "processor" that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline.

To automate the optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness. 

Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against:

  • Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task.

  • Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another.

  • Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations.

To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline:

  1. Digester: Compresses execution traces into structured summaries to identify where the agent failed.

  2. Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks.

  3. Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment.

  4. Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting.

HarnessX enters a growing field of self-improving harness research — but what separates it is harness-model co-evolution.

The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities.

HarnessX interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development.

HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as DeepSeek-R1. 

When fine-tuning the model, cross-harness GRPO pools an agent's execution trajectories for the same task across entirely different versions of the application's harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations.

HarnessX in action on industry benchmarks

To validate the practical utility of HarnessX, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning.

They separated the AI into two roles. The “meta-agent,” powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The “task agents” ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B.

HarnessX was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code.

Dynamically evolving the harness yields significant gains on the same base model. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%.

The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering. 

Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models.

Anecdotal evidence from the experiments shows how HarnessX solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site's JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the MediaWiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks.

During the WebShop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking "next page" and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance.

Limits of automated harness engineering

One important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested.

Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, HarnessX will not be able to improve the agent’s overall abilities (the researchers observed this with the Qwen3.5-9B model on the SWE-bench coding tests).

Despite these limitations, HarnessX makes a concrete case that harness engineering — not just model scaling — is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#big data performance#natural language processing for spreadsheets#no-code spreadsheet solutions#enterprise data management#big data management in spreadsheets#real-time data collaboration#data visualization tools#data analysis tools#machine learning in spreadsheet applications#self-service analytics tools#enterprise-level spreadsheet solutions#digital transformation in spreadsheet software#conversational data analysis#financial modeling with spreadsheets#intelligent data visualization#data cleaning solutions#business intelligence tools#automation in spreadsheet workflows