Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit
Our take

The relentless expansion of context windows in large language models (LLMs) has become a significant bottleneck, a challenge that's increasingly impacting the practical deployment of sophisticated AI agents. As agents navigate longer conversations, process more documents, and maintain extensive reasoning histories, the computational demands skyrocket, often exceeding the capabilities of available infrastructure. Existing solutions, like KV cache compression, frequently involve a tradeoff: either sacrificing model accuracy or failing to deliver tangible speedups in real-world serving environments. This makes the recent breakthrough announced by a collaborative research team – the development of Latent Context Language Models (LCLMs) – particularly noteworthy. Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights, demonstrating a parallel effort to optimize agent performance, and Xiaomi's new open-source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks showcases the demand for longer context windows in specialized applications. The promise of compressing input context *before* it even reaches the decoder, as LCLMs do, while maintaining accuracy and unlocking substantial speed improvements, represents a potentially transformative shift in how we build and deploy LLM-powered applications.
The elegance of the LCLM architecture lies in its ability to address a fundamental limitation of previous compression techniques. Rather than compressing after the full context has been loaded, LCLMs encode input tokens into shorter, latent representations *before* the decoding process begins. This proactive approach directly reduces the computational burden on the decoder, resulting in the reported 8.8x speedup over KV cache baselines at a 16x compression ratio. Crucially, the minimal accuracy degradation – less than 3 points on the RULER benchmark at 4x compression, and even outperforming other methods at 16x – underscores the viability of this approach for production environments. The impressive performance on the GSM8K math word problems, where the LCLM outperforms other methods even when compressing the full prompt, highlights the model's versatility and potential across different use cases. The research team’s focus on an end-to-end training methodology, blending continual pre-training, supervised fine-tuning, and an auxiliary reconstruction task, further reinforces the robustness and generalizability of the LCLM design.
Beyond the technical details, the implications of LCLMs extend to the wider landscape of AI agent development. The ability to process significantly longer contexts at a fraction of the cost unlocks new possibilities for building more capable and nuanced agents. As Micah Goldblum rightly points out, this effectively gives models access to much larger contexts, enabling "multiscale approaches where your model can skim vast amounts of text or code super fast and then only zooms in and fully reads a small portion of the most useful text." This mirrors human cognitive processes, allowing for more efficient information processing and decision-making. The ease of integration – simply swapping out LCLMs for existing LLMs – is another compelling factor, reducing the barrier to adoption for organizations already invested in LLM infrastructure. However, as Goldblum cautions, tuning RAG systems and addressing the challenge of reasoning trace compression remain important considerations for practical implementation.
Ultimately, the development of LCLMs marks a significant step towards overcoming the context window bottleneck that has been hindering the widespread adoption of LLMs in production environments. While challenges remain, particularly around reasoning trace compression and the need for careful RAG pipeline validation, the potential benefits are immense. The ability to handle longer contexts more efficiently and accurately will accelerate the development of more sophisticated AI agents capable of tackling increasingly complex tasks. The question now becomes: how quickly can enterprises integrate this technology and begin to unlock its full potential, and what new applications will emerge as a result of this expanded contextual awareness?
Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory and compute that growing context demands. Most existing solutions either degrade model accuracy, require the full context to load before compression begins, or produce memory savings that don't translate into real speedups in standard serving infrastructure.
A research team from NYU, Columbia, Princeton, University of Maryland, Harvard and Lawrence Livermore National Laboratory published a paper this week that proposes a novel fix. The researchers introduce the concept of Latent Context Language Models, or LCLMs, a family of encoder-decoder compression models that compress input context before it reaches the decoder. The models are open-sourced on HuggingFace.
Unlike KV cache compression methods — the dominant approach in the field, which still materialize the full KV cache before evicting entries — LCLMs compress the input token sequence before decoder prefill, so higher compression ratios directly reduce decoder-side compute and memory. The paper reports LCLMs at 16x compression produced output 8.8 times faster than KV cache baselines on the RULER long-context benchmark.
"These ballooning contexts take up memory and compute, and they are becoming a computational bottleneck for LLMs," Micah Goldblum, co-lead advisor on the project and a researcher at Columbia University, told VentureBeat. "Our goal was to train language models end-to-end that can handle very long contexts efficiently and accurately. If you can make such a language model, everything becomes cheaper and faster."
What LCLMs can do
LCLMs let models process much longer contexts than would otherwise be practical, at a fraction of the memory and compute cost, without the accuracy degradation that makes most compression methods a poor tradeoff in production.
At 4x compression, the paper reports accuracy of 91.76% on the RULER benchmark, compared to 94.41% with no compression at all. That is less than a 3 point drop for cutting context to a quarter of its original size. At 16x compression, where 93.75% of input tokens are removed, accuracy fell to 75.06%. Every KV cache method tested at the same compression ratio scored lower.
The gains hold on shorter inputs too. On GSM8K math word problems, where the full prompt is compressed rather than just retrieved documents, LCLMs outscored every other method tested regardless of compression ratio.
How it was built
The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings. The decoder processes those in place of the original tokens. Training ran across more than 350 billion tokens.
The training recipe mixes three data types:
Continual pre-training data with compressed and uncompressed spans interleaved throughout
Supervised fine-tuning data covering reasoning and long-context tasks
An auxiliary reconstruction task that pushes the encoder to retain fine-grained detail
The combination addresses a tradeoff that limited earlier compression work, where preserving reconstruction accuracy came at the cost of general task performance.
An architecture search identified the optimal configuration. The paper found that scaling the decoder matters more than scaling the encoder.
Where it fits in an agentic stack
An LCLM is not an abstract research concept. It is designed to work with an existing stack. "You can simply swap out LCLMs for any existing LLM," Goldblum said. "Whenever you retrieve data such as documents and want to dump it into your model's context, simply run those documents through the LCLM's compressor first."
He noted that in the research paper, the researchers demonstrated how to build agents that selectively decompress useful text.
"Think about this like a human skimming content before zooming in on relevant details," Goldblum said.
Goldblum also cautioned that teams integrating the approach into existing agentic pipelines will need to tune their RAG systems accordingly.
"We also haven't worked on online compression of reasoning traces," he said. "The naive approach of just occasionally compressing the trace while generating it might work, but that remains to be determined."
What this means for enterprises
Context windows are growing faster than inference infrastructure can keep up, and enterprises are already spending to fix it. VB Pulse Q1 2026 survey data from 100-plus employee organizations shows hybrid retrieval adoption intent tripling from 10.3% in January to 33.3% in March. Retrieval optimization overtook evaluation as the top investment priority by March, reaching 28.9% of qualified respondents.
Three things stand out for teams evaluating production fit:
Inference cost scales with context length. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports LCLMs at 16x compression remain within memory bounds at that context length.
RAG pipeline integration requires tuning. Teams with existing RAG pipelines will need to validate compression behavior against their retrieval quality metrics before deploying at scale.
Reasoning trace compression is unsolved. For agents running long reasoning chains, context growth from the trace is a separate problem from document retrieval. Goldblum acknowledged the gap directly: the naive approach of periodic trace compression might work but has not been tested.
The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM.
"The biggest things our architectures do is give your model access to much larger contexts, but they also unlock multiscale approaches where your model can skim vast amounts of text or code super fast and then only zooms in and fully reads a small portion of the most useful text," Goldblum said.
Read on the original site
Open the publisher's page for the full experience