7 min readfrom VentureBeat

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

Our take

RecursiveMAS represents a significant advancement in multi-agent AI systems, achieving 2.4x faster inference while reducing token usage by 75%. Traditional text-based communication among agents often leads to latency and inflated costs, hindering efficiency. Developed by researchers at the University of Illinois Urbana-Champaign and Stanford University, RecursiveMAS enables agents to share information in embedding space rather than through text, enhancing both speed and performance. This innovative framework not only improves accuracy across complex domains but also offers a cost-effective approach for scalable multi-agent solutions.

The recent development of RecursiveMAS, a framework that facilitates communication between multi-agent AI systems through embedding space rather than traditional text sequences, represents a significant leap forward in AI technology. By addressing the inherent bottlenecks of text-based communication, RecursiveMAS not only enhances inference speed but also drastically reduces token usage, which is crucial given the rising costs associated with token generation in AI models. This innovation comes at a time when the demand for efficient AI solutions is escalating, as evidenced by initiatives like Intercom, now called Fin, launches an AI agent whose only job is managing another AI agent, which reflect a growing trend of integrating AI into complex workflows.

The challenges faced by multi-agent systems have long been rooted in their reliance on sequential text generation, which introduces latency and complicates the training process. Traditional methods require agents to wait for one another to finish generating text, leading to inefficiencies that hinder real-time applications. RecursiveMAS shifts this paradigm by allowing agents to pass continuous latent representations back and forth, effectively enabling them to "think" collaboratively without the delays associated with text. This approach not only streamlines the communication process but also opens the door for more sophisticated interactions among agents, paving the way for advancements in fields like code generation and medical reasoning. This progress echoes the urgency for innovation in AI as highlighted by discussions around energy demands in Silicon Valley’s vacationland needs a new energy provider just as AI is driving prices up.

Moreover, the cost-effectiveness of RecursiveMAS stands out as a defining feature. By reducing the need for full model fine-tuning and instead optimizing only the lightweight RecursiveLink components, organizations can deploy multi-agent systems that are both scalable and economical. This is particularly relevant for enterprises that seek to adopt AI solutions without incurring prohibitive costs, thus making advanced AI more accessible to a broader range of users. The ability to leverage existing models without requiring significant computational resources is a game changer in environments where efficiency and speed are paramount. As seen in the implications of the recent hotel check-in system data breach mentioned in A hotel check-in system left a million passports and driver’s licenses open for anyone to see, the need for secure and efficient data handling is more critical than ever.

Looking ahead, RecursiveMAS could catalyze a shift in how we think about multi-agent systems. The ability to enhance collaboration among AI agents while maintaining a focus on efficiency and reduced resource consumption will likely influence future developments in the field. As enterprises explore the potential of these advanced systems, the question arises: how will RecursiveMAS and similar frameworks redefine the landscape of AI-driven applications? As we witness the convergence of AI systems into more complex and integrated workflows, the implications for productivity, cost management, and even ethical considerations in AI deployment will be worth monitoring closely. The future of multi-agent systems has never looked more promising, and the groundwork laid by RecursiveMAS could serve as a blueprint for innovation in AI collaboration.

How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%

One of the key challenges of current multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, drives up token costs, and makes it difficult to train the entire system as a cohesive unit. 

To overcome this challenge, researchers at University of Illinois Urbana-Champaign and Stanford University developed RecursiveMAS, a framework that enables agents to collaborate and transmit information through embedding space instead of text. This change results in both efficiency and performance gains. 

Experiments show that RecursiveMAS achieves accuracy improvement across complex domains like code generation, medical reasoning, and search, while also increasing inference speed and slashing token usage. 

RecursiveMAS is significantly cheaper to train than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective blueprint for custom multi-agent systems.

The challenges of improving multi-agent systems

Multi-agent systems can help tackle complex tasks that single-agent systems struggle to handle. When scaling multi-agent systems for real-world applications, a big challenge is enabling the system to evolve, improve, and adapt to different scenarios over time. 

Prompt-based adaptation improves agent interactions by iteratively refining the shared context provided to the agents. By updating the prompts, the system acts as a director, guiding the agents to generate responses that are more aligned with the overarching goal. The fundamental limitation is that the capabilities of the models underlying each agent remain static. 

A more sophisticated approach is to train the agents by updating the weights of the underlying models. Training an entire system of agents is difficult because updating all the parameters across multiple models is computationally non-trivial.

Even if an engineering team commits to training their models, the standard method of agents communicating via text-based interactions creates major bottlenecks. Because agents rely on sequential text generation, it causes latency as each model must wait for the previous one to finish generating its text before it can begin its own processing. 

Forcing models to spell out their intermediate reasoning token-by-token just so the next model can read it is highly inefficient. It severely inflates token usage, drives up compute costs, and makes iterative learning across the whole system painfully slow to scale. 

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, standalone component, RecursiveMAS is designed to co-evolve and scale the entire multi-agent system as a single integrated whole. 

The framework is inspired by recursive language models (RLMs). In a standard language model, data flows linearly through a stack of distinct layers. In contrast, a recursive language model reuses a set of shared layers that processes the data and feeds it back to itself. By looping the computation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture that acts as a unified recursive system. In this setup, each agent functions like a layer in a recursive language model. Rather than generating text, the agents iteratively pass their continuous latent representations to the next agent in the sequence, creating a looped hidden stream of information flowing through the system. 

This latent hand-off continues down the line through all the agents. When the final agent finishes its processing, its latent outputs are fed directly back to the very first agent, kicking off a new recursion round. 

This structure allows the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds entirely in the latent space, with only the very last agent producing a textual output in the final round. It is like the agents are communicating telepathically as a unified whole and the last agent provides the final response as text.

The architecture of latent collaboration

To make continuous latent space collaboration possible, the authors introduce a specialized architectural component called the RecursiveLink. This is a lightweight, two-layer module designed to transmit and refine a model's latent states rather than forcing it to decode text. 

A language model's last-layer hidden states contain the rich, semantic representation of its reasoning process. The RecursiveLink is designed to preserve and transmit this high-dimensional information from one embedding space to another. 

To avoid the cost of updating every parameter across multiple large language models, the framework keeps the models' parameters frozen. Instead, it optimizes the system by only training the parameters of the RecursiveLink modules.

To handle both internal reasoning and external communication, the system uses two variations of the module. The inner RecursiveLink operates inside an agent during its reasoning phase. It takes the model's newly generated embeddings and maps them directly back into its own input embedding space. This allows the agent to continuously generate a stream of latent thoughts without generating discrete text tokens. 

The outer RecursiveLink serves as the bridge between agents. Because agents in a real-world system might use different model architectures and sizes, their internal embedding spaces have entirely different dimensions. The outer RecursiveLink includes an additional layer designed to match the embeddings from one agent's hidden dimension with the next agent's embedding space.

During training, first, the inner links are trained independently to warm up each agent's ability to think in continuous latent embeddings. Then, the system enters outer-loop training, where the diverse, frozen models are chained together in a loop, and the system is evaluated based on the final textual output of the last agent. 

The only thing that gets updated in the training process is the RecursiveLink parameters and the original model weights remain unchanged, similar to low-rank adaptation (LoRA). Another advantage of this system comes into effect when you have multiple agents on top of the same backbone model. 

If you have a multi-agent system where two agents are built on the exact same foundation model acting in different roles, you do not need to load two copies of the model into your GPU memory, nor do you train them separately. The agents will share the same backbone as the brain and use the RecursiveLink as the connective tissue.

RecursiveMAS in action

The researchers evaluated RecursiveMAS across nine benchmarks spanning mathematics, science and medicine, code generation, and search-based question answering. They created a multi-agent system using open-weights models including Qwen, Llama-3, Gemma3, and Mistral. These models were assigned roles to form different agent collaboration patterns such as sequential reasoning and mixture-of-experts collaboration. 

RecursiveMAS was compared to baselines under identical training budgets, including standalone models enhanced with LoRA or full supervised fine-tuning, alternative multi-agent frameworks like Mixture-of-Agents and TextGrad, and recursive baselines like LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces the agents to explicitly communicate via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest baselines across the benchmarks. It excelled particularly on reasoning-heavy tasks, outperforming text-based optimization methods like TextGrad by 18.1% on AIME2025 and 13% on AIME2026. 

Because it avoids generating text at every step, RecursiveMAS achieved 1.2x to 2.4x end-to-end inference speedup. RecursiveMAS is also much more token efficient than the alternative. Compared to the text-based Recursive-TextMAS, it reduces token usage by 34.6% in the first round of the recursion, and by round three, it achieves 75.6% token reduction. RecursiveMAS also proved remarkably cheap to train. Because it only updates the lightweight RecursiveLink modules, which consist of roughly 13 million parameters or about 0.31% of the trainable parameters of the frozen models, it requires the lowest peak GPU memory and cuts training costs by more than half compared to full fine-tuning.

Enterprise adoption

The efficiency gains — lower token consumption, reduced GPU memory requirements, and faster inference — are intended to make complex multi-step agent workflows viable in production environments without the compute overhead that limits enterprise agentic deployments. The researchers have released the code and trained model weights under the Apache 2.0 license.

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#natural language processing for spreadsheets#real-time data collaboration#natural language processing#cloud-based spreadsheet applications#real-time collaboration#generative AI for data analysis#Excel alternatives for data analysis#financial modeling with spreadsheets#enterprise data management#big data performance#AI formula generation techniques#big data management in spreadsheets#large dataset processing#no-code spreadsheet solutions#enterprise-level spreadsheet solutions#machine learning in spreadsheet applications#conversational data analysis#rows.com#intelligent data visualization#data visualization tools