June 18, 2026•9 min read•from VentureBeat

New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

Our take

Engineering teams face a persistent challenge: deploying AI agents that, despite initial success, often hallucinate or miss critical constraints in production. Addressing this requires tedious trial-and-error, making it difficult to pinpoint effective adjustments. Introducing Arbor, a new AI optimization framework developed by researchers at Renmin University of China and Microsoft Research, which delivers over 2.5 times the verifiable performance gains of standard AI coding agents like Claude Code and Codex – all within the same compute budget.

New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

The relentless pursuit of autonomous AI optimization has hit a significant milestone with the introduction of Arbor, a framework demonstrating remarkable gains over established coding agents like Claude Code and Codex. Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. To address this challenge, Arbor represents a move beyond the brute-force approach of simply throwing more compute at the problem, a tactic that often yields diminishing returns. This advancement comes at a time when enterprises are increasingly reliant on AI agents, and the need to efficiently manage and refine their performance is paramount; Anthropic's Claude Code Artifacts update brings live, shared dashboards and interactive workspaces to enterprises, highlighting the ongoing effort to improve AI agent workflows. Furthermore, Amazon hopes to challenge Nvidia more directly by selling its AI chips, signalling a broader shift toward accessible and scalable AI infrastructure.

The core innovation of Arbor lies in its structured approach to experimentation, moving away from the chaotic trial-and-error process that plagues many AI optimization efforts. The framework cleverly organizes the process as a "Hypothesis Tree," allowing the system to learn from previous failures and build upon successes in a cumulative fashion. This is a critical departure from existing agent architectures that treat each attempt in isolation, effectively erasing valuable insights. The “coordinator” and “executor” structure, where the coordinator charts the course while executors implement specific hypotheses in isolated environments, is particularly compelling. It neatly addresses the problem of entangled changes, making it possible to pinpoint exactly which adjustments contribute to improvements—a stark contrast to the frustrating ambiguity often encountered when tweaking prompts, retrieval methods, or chunking strategies. This level of attribution is not merely a convenience; it’s a fundamental prerequisite for reliable and scalable AI system refinement.

The reported performance gains – over 2.5 times the verifiable performance improvement compared to Claude Code and Codex on the same compute budget – are striking. This isn’t just about achieving slightly better results; it’s about dramatically increasing the efficiency of the optimization process. The framework's resilience against overfitting, demonstrated by its superior performance on held-out data, further solidifies its potential for real-world application. Focusing on “loop engineering,” as championed by figures like Peter Steinberger, and moving beyond simple prompts toward iterative cycles that drive autonomous agents, appears to be a vital step toward more robust and adaptable AI systems. Arbor’s ability to generalize learned optimizations across different tasks, as evidenced by its performance on unseen search-agent challenges, suggests a level of intelligence and adaptability that goes beyond mere task-specific tuning. The framework’s design, which allows for seamless integration with existing Git workflows, further reduces the barrier to adoption for engineering teams already comfortable with version control practices.

Looking ahead, the potential for Arbor to evolve into a broader platform for AI system development is significant. The researchers' vision of extending the framework to handle multi-objective optimization, where nodes represent vectors of metrics rather than single scores, is particularly exciting. This capability would enable more nuanced and sophisticated AI systems that can balance competing priorities, such as accuracy, latency, and cost. Another crucial question is how Arbor’s principles can be applied to other domains beyond code optimization, such as drug discovery or materials science, where iterative experimentation is essential. Will we see similar tree-based frameworks emerge to tackle the complexities of autonomous exploration in other scientific fields, or will Arbor’s approach prove to be a uniquely valuable solution for AI-driven software engineering?

Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. It requires a tedious, trial-and-error process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously. Because these adjustments are entangled, it becomes nearly impossible to attribute which specific tweak actually solved the problem.

To address this challenge, researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from a sequence of trial-and-error guesses into a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time.

In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget.

For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems.

Understanding the bottleneck in autonomous optimization

As large language models and AI systems become more capable, they are expected to carry out more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms.

AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact, such as a machine learning codebase or data pipeline, and a specific objective. The agent's goal is to iteratively improve this artifact through experimental feedback without step-by-step human supervision.

The main challenge of AO is often misunderstood. Many engineering teams find that simply giving a coding agent more time or compute to optimize a codebase doesn't lead to better results. "Automation can keep an AI working for a very long time — but a loop is not the same as progress," Jiajie Jin, co-author of the paper, told VentureBeat. "If the goal is vague, or the metric is easy to hack, long-running automation often just produces 'improvements' faster that nobody actually wants."

Jin explains that complex tasks take many attempts to get right, and standard agent architectures are missing the critical data structure to maintain state. "How do you make sure the insight and experience from each attempt actually accumulate, instead of getting lost in a scrollback buffer?" he said. Without this structure, agents simply repeat the same mistakes.

Current agent systems can run experiments for many hours against well-specified goals: editing code, invoking tools, running tests autonomously. But they treat each attempt in isolation, missing the structural mechanisms that would let them accumulate and act on what they've learned.

They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration, which is the core mechanism that makes human research cumulative.

General coding agents typically rely on conversation transcripts for their memory. Because AO tasks span hundreds of turns and easily exceed context window limits, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they lose the overarching structure of the research process and are prone to stalling on early failures or chasing noisy evaluation swings. The system needs a structured, durable memory that records what directions have been tried, what factual evidence was produced, and how each result changes the space of future hypotheses.

Existing frameworks are also prone to reward hacking and overfitting to development metrics. This makes them create the illusion of progress without producing improvements that transfer to real-world performance.

Finally, general-purpose coding agents typically chain their tool calls on a single shared working tree. This architectural limitation prevents them from testing parallel hypotheses in isolated environments without corrupting the main codebase or obscuring which hypothesis caused a specific outcome.

The Arbor framework

Arbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from the ground-level coding tasks with two key components:

The coordinator: A long-lived AI agent that acts like a principal investigator. It never directly edits the target codebase. Instead, it owns the general state of the optimization research, observes accumulated evidence, comes up with new hypotheses and directions to explore, and decides what to do with the results of experiments.

Executors: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, it spins up an executor and places it in an isolated environment, essentially a fresh git worktree. Each executor is handed one hypothesis. It implements the assigned idea, runs evaluations, debugs errors, and reports back to the coordinator with the results and created artifacts.

These two components collaborate through a mechanism that the researchers call “Hypothesis Tree Refinement” (HTR). HTR represents the entire research process as a persistent, branching tree where every node binds together four things: a hypothesis, the executable artifact, the factual evidence produced, and a distilled insight. This means the coordinator can explore multiple competing directions at the same time without losing its place.

The coordinator builds the tree by placing broad ideas near the root, while concrete refinements branch out as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor's experiment fails, the tree records why it failed as a negative constraint, ensuring the system doesn't endlessly repeat the same mistake.

To understand why Arbor's isolation matters, consider a common enterprise scenario: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. "When you ask a single agent like Claude Code or Codex to 'improve accuracy,' it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method," Jin said. This entangles the changes, making it impossible to attribute which one actually helped. It also directly mutates the repository without isolation.

Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, retrieval another, and the prompt another — each implemented and evaluated in its own isolated git worktree. "So you get clean attribution: 'constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'" Jin said.

When an executor returns a report, the coordinator writes the evidence to the tree and backpropagates the insight upward to parent nodes. This means a local observation becomes a generalized constraint that shapes the coordinator's future idea generation.

To prevent reward hacking or overfitting to the development data, HTR enforces a strict “merge gate.” Even if an executor reports a fantastic development score, the coordinator will spin up an isolated worktree to test the candidate against a held-out test evaluator. The artifact is only merged into the current best trunk if it demonstrably improves the test score, verifying that the progress is real.

Arbor generally falls under the concept of "loop engineering," popularized by industry figures like OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny. The idea is to move beyond single prompts to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, "A loop can fill up with messy, untraceable attempts, and you end up with nothing to show and no way to reconstruct what changed."

Arbor in action

The researchers evaluated Arbor on an autonomous optimization task suite built from real-world research settings and the MLE-Bench Lite machine learning engineering benchmark. The AO suite featured tasks from different areas of AI development, including model training, harness engineering, and data synthesis.

The researchers used different backbone models for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, Codex and Claude Code. Arbor and the baselines were given the same resources. For the MLE-Bench Lite tasks, Arbor was also compared against top-tier agentic research systems like AI-Scientist, ML-Master, and AIDE.

Arbor consistently outperformed the baselines. It achieved the best held-out test result on all tasks, attaining more than 2.5 times the average relative gain of Codex and Claude Code. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system's held-out accuracy from a baseline of 45.33% to 67.67%. Meanwhile, Codex and Claude Code stalled at 50% and 53.33%, respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest result among all benchmarked systems.

Arbor proved to be resilient against overfitting. For example, during the Terminal-Bench 2.0 task experiments, Claude Code achieved a high development score of 75 but its score dropped to 71 on the held-out data. Arbor had a lower development score of 72.22 but achieved the highest held-out score of 77.36, ensuring its results transfer to real-world applications.

Arbor also showed generalization in a cross-task transfer experiment. After Arbor finished optimizing the search harness for the BrowseComp task, researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor's optimized codebase significantly improved performance on those unseen tasks as well.

Deploying Arbor: Sweet spots and hidden costs

For engineering leads looking to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replacing them. "Its output is an ordinary git branch that your existing code review, CI, and human review can inspect directly," Jin said. Only verified gains are merged into a per-run trunk, leaving the main repository untouched until a developer manually chooses to promote the code.

However, deploying Arbor comes with specific tradeoffs. Jin points out that the biggest catch is token cost, as maintaining a long-lived coordinator that continuously manages the tree and dispatches executors is the dominant expense. Running multiple isolated worktrees concurrently also requires genuine compute and disk resources to process real experiments.

So where is Arbor's sweet spot? According to Jin, it excels at tasks with a clear, trustworthy metric, tolerance for a long time horizon, and a real search space with several plausible directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning.

Conversely, teams should explicitly avoid using Arbor for real-time latency tasks, obvious one-line fixes, or when the underlying evaluation metric is flawed. The quality ceiling of the entire run is strictly bounded by the quality of the evaluator. "If the metric isn't trustworthy, Arbor will just optimize toward an untrustworthy result faster," Jin said.

Jin sees the next evolution going beyond single scalar metrics. "A natural evolution is to have each node's artifact carry a vector — accuracy, latency, cost — instead of a single score," Jin said. "Going from a single scalar to a multi-objective Pareto search is a very natural extension of the framework."

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#real-time data collaboration#generative AI for data analysis#Excel alternatives for data analysis#natural language processing for spreadsheets#real-time collaboration#financial modeling with spreadsheets#no-code spreadsheet solutions#big data performance#enterprise data management#data visualization tools#data analysis tools#big data management in spreadsheets#machine learning in spreadsheet applications#conversational data analysis#intelligent data visualization#data cleaning solutions#enterprise-level spreadsheet solutions#automation in spreadsheet workflows#natural language processing#AI formula generation techniques