Enterprise AI agents keep failing because they forget what they learned
Our take

The recent discussion around the limitations of Retrieval-Augmented Generation (RAG) architectures in enterprise AI underscores a critical shift in how we think about decision-making in intelligent systems. RAG frameworks excel in surfacing semantically relevant documents, but they falter when it comes to applying that information effectively. This shortcoming creates a gap that many organizations experience, particularly when employing AI agents to make informed decisions based on vast and complex data sets. As noted in the article, enterprises are often hindered by the lack of structured decision context, which can lead to misguided actions based on incomplete or outdated information. This issue is particularly pressing in high-stakes environments, like banking, where even a 1% error margin can have catastrophic consequences. For further insight into how AI technology continues to evolve, consider reading Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+ and Google's Managed Agents API promises one-call deployment at the cost of execution layer control.
The introduction of decision context graphs represents a significant advancement in addressing these challenges. By providing structured memory, time-aware reasoning, and explicit decision logic, these graphs enable AI agents to operate without regression. This means agents can retain and build upon validated actions over time, enhancing their effectiveness and reliability. The non-regressive nature of these agents is a game-changer. It allows them to learn from their experiences without the fear of losing previously acquired knowledge—a critical factor for maintaining high performance in dynamic enterprise environments. This capability also fosters a more robust approach to data management, where the focus shifts from merely retrieving information to actively applying it in a contextually relevant manner.
The implications of adopting decision context graphs extend beyond mere operational efficiency. They signify a paradigm shift in how we understand AI's role in decision-making processes. The traditional reliance on keyword searches and retrieval systems has proven insufficient for complex decision-making scenarios. Instead, organizations must embrace frameworks that prioritize structured context and temporal relevance. With time as a crucial dimension, agents can discern what rules apply at any given moment, thus reducing the risk of errors that stem from outdated or conflicting information. This structured approach not only enhances the reliability of AI agents but also aligns with a broader trend toward making AI more explainable and predictable—qualities that are essential for gaining trust in enterprise applications.
Looking ahead, the challenge will be ensuring that the automatic ontology generation remains robust against the messy, diverse data that enterprises typically encounter. As AI continues to evolve, the focus must be on refining these technologies to enhance their adaptability and effectiveness in real-world scenarios. The notion of agents learning without regression is an exciting frontier, prompting us to ask—how can we further empower these systems to explore, learn, and adapt in complex environments? The journey toward truly intelligent agents is ongoing, and the advancements in decision context frameworks are a promising step in that direction.
As we observe these developments, it will be crucial to monitor how decision context graphs are implemented across various industries. Will they become the new norm in enterprise AI, or will challenges in data diversity and complexity hinder their widespread adoption? The answers to these questions will shape the future landscape of AI applications in business.
RAG architectures are good at one thing: surfacing semantically relevant documents. That's also where they stop.
A framework called a decision context graph addresses that gap by giving agents structured memory, time-aware reasoning, and explicit decision logic. Rippletide, a startup in the Neo4j ecosystem, has built one. The key capability: agents that are non-regressive, able to freeze validated sequences of actions and compound on them over time.
“The key point you want is non-regressivity: How do you make sure that, when the agent will generate something new, you can compound on the previous discoveries?” said Yann Bilien, Rippletid’s co-founder and chief scientific officer.
Why RAG doesn’t go far enough
Enterprise context is sprawled across ERP tools, logs, databases, vector stores, and policy documents. Generative AI tools can retrieve from all of it — through keyword search, SQL queries, or full RAG pipelines — but retrieval has a ceiling.
Notably, data retrieved may not be relevant to the decision at hand (thus causing hallucinations); and, even if agents do pull the right data, they often lack guidance to make decisions backed by a strong rationale.
That is, RAG retrieves documents, not decision context. “Everyone starts with RAG: Pull relevant docs, stuff them in the prompt, let the model figure it out,” said Wyatt Mayham of Northwest AI Consulting.
While that works fine for chatbots, it “breaks immediately” for agents that need to make decisions and take actions, he pointed out. “The biggest thing builders struggle with is the gap between retrieval and applicability.”
A retrieved document doesn’t tell the agent whether it still applies, whether it’s been superseded, or whether there’s a conflicting rule that takes priority, Mayham said. “Agents need decision context, not just information.”
In construction (the human world), that might mean knowing that a pricing exception expired, that a safety policy only applies in certain jurisdictions, or that a standard operating procedure was updated a month prior. “Miss any of that, and the agent confidently does the wrong thing,” Mayham said.
Without structured decision context, agents combine incompatible rules, invent constraints to fill gaps, and rely on what Bilien calls "probabilistic guesses over unbounded data." Errors are difficult to reproduce because builders can't trace why the agent made a given choice.
The compounding error problem is real, too, Mayham said: A small miss rate per step becomes “catastrophic” across a multi-step workflow. “That’s the main reason most enterprise agents never leave the pilot phase.”
How decision context graphs get to the relevant answer
A decision context graph solves this by encoding a structured map of what is applicable, what the rules are, and when they apply.
The framework is optimized for one question: "Given this situation, which context applies right now?" Time is treated as a first-class dimension; every rule, decision, and exception is scoped to when it is valid.
“The goal is to explicitly address missing, incoherent, or contradictory data when building the graph to avoid probabilistic [errors] once the agent is running,” Bilien said.
The system is built around three principles:
Applicability: Logic is explicitly encoded so the agent knows what rules to remember and apply in a given situation. Context is returned only when it is relevant to the situation.
Time‑aware memory: Every rule, decision, and exception is time-scoped. This allows agents to reason about "What was true then versus what is true now," then reproduce or explain its decisions.
Decision paths: The system can explain how it got from A to B and the "why" behind its rationale (for instance, why one piece of context was included and another was not). Agents are given "decision path" examples of how similar cases were handled before.
At setup, unstructured data is ingested and structured into an ontology: what entities exist, what rules apply, what counts as an exception. Neuro-symbolic AI handles the pattern recognition and encodes formal, machine-readable logic. Over time, the system refines its knowledge base as new decisions are made.
“Neuro-symbolic brings two parts: A neuronal part giving a large autonomy to agents and a symbolic part to reduce the number of data needed and bring control,” Bilien said.
The agent is tested at build time (pre-production) to validate its behaviors or pinpoint improvements. This reduces risks as well as computation needs during inferencing, he noted.
Agents learning, rather than regressing
When it comes to non-regression, the key piece is compounding both on intelligence (models) and on knowledge (shared between agents), Bilien said. It’s important that agents can explore; when they don’t know how to accomplish a task, they can attempt different possibilities, typically in a controlled environment or simulation (like a support bot trying multiple response patterns).
Then, “once a solution is evaluated as satisfactory, the graph freezes that sequence of actions,” Bilien said. Future exploration then starts from this “stable base of validated behaviors” to prevent newly-acquired skills from overwriting previously learned good behavior.
Before an agent acts or affects a customer, it checks against the graph: Is it violating a rule? Hallucinating? Staying within constraints? Can it generalize the solution across similar cases?
At a macro level, the system assesses outcomes: Did the behavior improve long-term performance? Did it generalize across similar contexts? Did it preserve previous capabilities?
“This determinism is key for agents to run reliability at scale,” Bilien said. It leads to behavior that is more consistent, predictable, explainable, and allowing for stronger control and auditability.
“You want your agents to be able to learn by themselves when they face something they don't know,” he said. “You want them to be able to explore and find new solutions.”
Getting beyond "episodic" memory
While the team initially assumed it would deploy RL everywhere, "that actually proved very difficult in an enterprise setting," Bilien said. "Data are scarce for some specific use cases and messy for others."
Typically, using raw data for reliable predictions has been a manual and time-consuming challenge, but “now with agents we entered a new era where building ontologies is possible automatically,” Bilien said.
Classic supervised fine-tuning methods can lead to oscillations, when models forget the last skill they learned while learning the next tone. Overall, learning is not compounded, compression is “dramatic,” and models improve “episodically” rather than continuously, leading them to continually fail on new or unseen tasks.
As Bilien noted: “You will never have a fully self-learning model if you are regressing every time.”
In enterprise use cases — like banking where millions of transactions are processed a day — a high level of reliability is critical, he noted. “One question I ask all customers: Is 95% enough? In a lot of use cases, it's not. You need 99.999%. 1% off is way too much.”
Decision context graphs can close that gap, he contends: When the same customer support question is asked repeatedly, the agent will return a “satisfactory” answer predictably and without regression, all while retaining autonomy.
Encoding applicability and temporal validity into a structured graph — rather than relying on an LLM to infer it — is a "sound approach" to a real limitation in existing retrieval frameworks, Mayham said. The open question is whether the automatic ontology generation holds up against the messy, diverse data that enterprises actually have. "That's always the hard part," he said.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems. New research from Databricks puts a number on that failure gap. The company's AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks and reported gains of 20% or more on Stanford's STaRK benchmark suite, along with consistent improvement across Databricks' own KARLBench evaluation framework, according to the research. Databricks argues the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem. The work builds on Databricks' earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures. "RAG works, but it doesn't scale," Michael Bendersky, research director at Databricks, told VentureBeat. "If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task." Single-turn retrieval cannot encode structural constraints The core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search. Consider a question like "Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?" The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results. To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain, according to the research. STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base. How the Supervisor Agent handles what RAG cannot Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps: Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first. Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer. Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required. "The agent can do things like decomposing the question into a SQL query and a search query out of the box," Bendersky said. "It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found." It's not just about hybrid retrieval The distinction Databricks draws isn't about retrieval technique, it's about architecture. "We almost don't see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables," he said. "We see this more as an agent that has access to multiple tools." The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code. Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks' research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format. "Just bring the agent to the data," Bendersky said. "You basically give the agent more sources, and it will learn to use them pretty well." What this means for enterprises For data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest. The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront. Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start. The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one. "This is kind of like a ladder," Bendersky said. "The agent will slowly get more and more information and then slowly improve overall."
- Context architecture is replacing RAG as agentic AI pushes enterprise retrieval to its limitsRedis built its name as the caching layer that kept web applications from collapsing under load. The problem it is targeting now has the same structure but is harder to solve: production AI agents failing not because the models are wrong, but because the data underneath them is scattered, stale and structured for humans rather than machines. Retrieval pipelines built for single queries cannot absorb the volume agents generate. The gap Redis is targeting is structural: agents make orders of magnitude more data requests than human users, but most retrieval layers were built for the human-scale problem. Redis Iris, launched Monday, is the company's answer: a context and memory platform that sits between an agent and the data it needs to act. The platform combines real-time data ingestion, a semantic interface that auto-generates MCP tools from business data models, and an agent memory server built on Redis Flex, a rewritten storage engine that runs 99% of data on flash at a tenth of the cost of in-memory storage alone. The announcement lands as enterprise RAG infrastructure is in active transition. VentureBeat's Q1 2026 VB Pulse RAG Infrastructure Market Tracker found buyer intent to adopt hybrid retrieval tripling from 10.3% to 33.3% between January and March. Retrieval optimization surpassed evaluation as the top enterprise investment priority for the first time. Custom in-house retrieval stacks rose from 24.1% to 35.6% as enterprises outgrew off-the-shelf options. Redis is not the only infrastructure vendor reading those signals — several data platform providers have repositioned around agent context layers in recent weeks. The scale mismatch is the structural argument behind the launch. "Companies will have orders of magnitude more agents than human beings," Rowan Trollope, CEO of Redis, told VentureBeat. "Orders of magnitude more agents than human beings means orders of magnitude more load on back end systems." From cache to context Trollope traces the parallel back to the mobile era: When legacy backends built for branch tellers suddenly had to serve a million smartphone users, Redis became the caching layer that absorbed the load without a full rebuild. What is different this time is that agents cannot write their own middleware. In the mobile era, a developer would sit with a database administrator, identify the queries an application needed and hard-code the caching logic into a middleware layer. Agents cannot do that. They need to find the right data at runtime, through interfaces built for them in advance, or they stall. "This is like the analogy of the grocery store in the fridge," he said. "If every time you have to go make your sandwich, you have to run to the grocery store to get the food, that's not very efficient. You put a fridge in every house, you store a little bit of food there. And that's kind of where we still tend to exist in the infrastructure stack." What Redis Iris includes Iris ships five components that together cover data ingestion, semantic access, memory and caching. Redis Data Integration. Now in general availability. RDI uses change data capture pipelines to sync data from relational databases, warehouses and document stores into Redis continuously, with connectors for Oracle, Snowflake, Databricks and Postgres. Context Retriever. Now in preview. Developers define a semantic model of business data using pydantic models and Redis auto-generates MCP tools agents use to query it directly, with row-level access controls enforced server-side. Trollope describes the shift from classic RAG as a directional inversion. "It's just a flip to let the agent pull the data instead of presupposing and stuffing it into the pipeline," he said. Agent Memory. Now in preview. Stores short and long-term state across sessions so agents carry context without re-deriving it on each turn. Redis Flex. A rewritten storage engine that runs 99% of data on SSDs and 1% in RAM, delivering petabyte-scale retrieval at sub-millisecond latencies. Redis Search and LangCache. The retrieval and semantic caching backbone underneath the platform. LangCache reduces redundant model calls by caching prompt responses. What analysts say The data industry is generally heading in the same direction now. Every major database vendor is making a context layer argument. Traditional database vendors including Oracle are integrating context and memory layers to bring relational databases into the agentic AI era. Purpose-built vector database vendors including Pinecone are doing the same, building out a new knowledge layer for agentic AI context. Standalone context layers like Hindsight are also part of the emerging landscape. Trollope frames Redis's position as structurally different from that competition. "For us to win, no one else has to lose," he said. Many Redis deployments already run MongoDB or Oracle as the backend system of record. Iris reflects and caches from those systems rather than displacing them. Redis is launching Iris in the Snowflake marketplace with native connectors. Stephanie Walter, Practice Leader for AI Stack at HyperFRAME Research, puts the market context plainly. "The market is converging on the same conclusion: agents don't just need more tokens or better models. They need governed, current, low-latency context," Walter said. Her read on Redis's differentiation focuses on where Redis already sits in the stack, which is close to runtime, latency-sensitive operational state, and real-time data., "The pitch is not 'better RAG' as much as 'agents need live context, memory, and fast retrieval while they are actually working," she said. Whether it's Redis or another vendor, every context layer technology will face a governance challenge to be successful. "Agentic AI will not scale in the enterprise if every agent becomes a new cost center, a new data access risk, and a new governance exception," she said. "The winning context layers will be the ones that make agents faster, cheaper, and safer to run." For real-time clinical AI, getting context wrong is not an option Mangoes.ai is one company that has already had to answer those questions in production, under conditions where the cost of getting context wrong is measured in patient outcomes. Amit Lamba, founder and CEO of Mangoes.ai, runs a real-time voice AI platform deployed across large healthcare facilities where patients and clinicians ask live questions about treatment, scheduling and case history. Mangoes.ai built its stack natively on Redis from the start. "Retrieval, memory, and session state all run through Redis, so we're not stitching together separate tools and hoping they talk to each other," Lamba said. The problem Iris's dynamic memory capability addresses is what happens across a complex session. "Think about a one-hour group therapy session," Lamba said. "You need to know who said what, when, and be able to surface the right information to the therapist in the moment. That's not a simple retrieval problem." The platform runs multiple specialized agents in parallel, one for entity identification, one for relationship reasoning and one for integrating case history. "The dynamic memory capability maps almost perfectly to the problem we're solving," Lamba said. What this means for enterprises For enterprises that built their AI stack around RAG, the retrieval layer that got them to production is no longer enough to keep them there The RAG era is giving way to context architecture. The classic RAG model pushed data into the agent before the model was called. Production deployments are flipping that: agents pull what they need at runtime through tool calls, treating the data layer as a live resource rather than a pre-loaded payload. Teams still optimizing RAG pipelines are solving last year's problem. The semantic layer is now production infrastructure. The model that defines business entities, their relationships and the access rules between them needs to be built, versioned and maintained with the same discipline as a data pipeline. Most organizations have not staffed or structured for that work. The enterprises that define their context architecture now are the ones that will not have to rebuild it when agent workloads scale. Budget is already moving. VB Pulse Q1 2026 data shows retrieval optimization investment rising from 19% to 28.9% across the quarter, overtaking evaluation spending for the first time. Organizations that spent the previous year measuring their retrieval quality are now spending to fix it. The context layer is an active procurement decision, not a roadmap item. "The first buyer question should not be 'Do I need a vector database, long context, memory, or a context engine?' It should be 'What does this agent need to know, how fresh must that knowledge be, who is allowed to access it, and what does every retrieval cost?'" Walter said.
- How xMemory cuts token costs and context bloat in AI agentsStandard RAG pipelines break when enterprises try to use them for long-term, multi-session LLM agent deployments. This is a critical limitation as demand for persistent AI assistants grows. xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, solves this by organizing conversations into a searchable hierarchy of semantic themes. Experiments show that xMemory improves answer quality and long-range reasoning across various LLMs while cutting inference costs. According to the researchers, it drops token usage from over 9,000 to roughly 4,700 tokens per query compared to existing systems on some tasks. For real-world enterprise applications like personalized AI assistants and multi-session decision support tools, this means organizations can deploy more reliable, context-aware agents capable of maintaining coherent long-term memory without blowing up computational expenses. RAG wasn't built for this In many enterprise LLM applications, a critical expectation is that these systems will maintain coherence and personalization across long, multi-session interactions. To support this long-term reasoning, one common approach is to use standard RAG: store past dialogues and events, retrieve a fixed number of top matches based on embedding similarity, and concatenate them into a context window to generate answers. However, traditional RAG is built for large databases where the retrieved documents are highly diverse. The main challenge is filtering out entirely irrelevant information. An AI agent's memory, by contrast, is a bounded and continuous stream of conversation, meaning the stored data chunks are highly correlated and frequently contain near-duplicates. To understand why simply increasing the context window doesn’t work, consider how standard RAG handles a concept like citrus fruit. Imagine a user has had many conversations saying things like “I love oranges,” “I like mandarins,” and separately, other conversations about what counts as a citrus fruit. Traditional RAG may treat all of these as semantically close and keep retrieving similar “citrus-like” snippets. “If retrieval collapses onto whichever cluster is densest in embedding space, the agent may get many highly similar passages about preference, while missing the category facts needed to answer the actual query,” Lin Gui, co-author of the paper, told VentureBeat. A common fix for engineering teams is to apply post-retrieval pruning or compression to filter out the noise. These methods assume that the retrieved passages are highly diverse and that irrelevant noise patterns can be cleanly separated from useful facts. This approach falls short in conversational agent memory because human dialogue is “temporally entangled,” the researchers write. Conversational memory relies heavily on co-references, ellipsis, and strict timeline dependencies. Because of this interconnectedness, traditional pruning tools often accidentally delete important bits of a conversation, leaving the AI without vital context needed to reason accurately. Why the fix most teams reach for makes things worse To overcome these limitations, the researchers propose a shift in how agent memory is built and searched, which they describe as “decoupling to aggregation.” Instead of matching user queries directly against raw, overlapping chat logs, the system organizes the conversation into a hierarchical structure. First it decouples the conversation stream into distinct, standalone semantic components. These individual facts are then aggregated into a higher-level structural hierarchy of themes. When the AI needs to recall information, it searches top-down through the hierarchy, going from themes to semantics and finally to raw snippets. This approach avoids redundancy. If two dialogue snippets have similar embeddings, the system is unlikely to retrieve them together if they have been assigned to different semantic components. For this architecture to succeed, it must balance two vital structural properties. The semantic components must be sufficiently differentiated to prevent the AI from retrieving redundant data. At the same time, the higher-level aggregations must remain semantically faithful to the original context to ensure the model can craft accurate answers. A four-level hierarchy that shrinks the context window The researchers developed xMemory, a framework that combines structured memory management with an adaptive, top-down search strategy. xMemory continuously organizes the raw stream of conversation into a structured, four-level hierarchy. At the base are the raw messages, which are first summarized into contiguous blocks called “episodes.” From these episodes, the system distills reusable facts as semantics that disentangle the core, long-term knowledge from repetitive chat logs. Finally, related semantics are grouped together into high-level themes to make them easily searchable. xMemory uses a special objective function to constantly optimize how it groups these items. This prevents categories from becoming too bloated, which slows down search, or too fragmented, which weakens the model’s ability to aggregate evidence and answer questions. When it receives a prompt, xMemory performs a top-down retrieval across this hierarchy. It starts at the theme and semantic levels, selecting a diverse, compact set of relevant facts. This is crucial for real-world applications where user queries often require gathering descriptions across multiple topics or chaining connected facts together for complex, multi-hop reasoning. Once it has this high-level skeleton of facts, the system controls redundancy through what the researchers call "Uncertainty Gating." It only drills down to pull the finer, raw evidence at the episode or message level if that specific detail measurably decreases the model’s uncertainty. “Semantic similarity is a candidate-generation signal; uncertainty is a decision signal,” Gui said. “Similarity tells you what is nearby. Uncertainty tells you what is actually worth paying for in the prompt budget.” It stops expanding when it detects that adding more detail no longer helps answer the question. What are the alternatives? Existing agent memory systems generally fall into two structural categories: flat designs and structured designs. Both suffer from fundamental limitations. Flat approaches such as MemGPT log raw dialogue or minimally processed traces. This captures the conversation but accumulates massive redundancy and increases retrieval costs as the history grows longer. Structured systems such as A-MEM and MemoryOS try to solve this by organizing memories into hierarchies or graphs. However, they still rely on raw or minimally processed text as their primary retrieval unit, often pulling in extensive, bloated contexts. These systems also depend heavily on LLM-generated memory records that have strict schema constraints. If the AI deviates slightly in its formatting, it can cause memory failure. xMemory addresses these limitations through its optimized memory construction scheme, hierarchical retrieval, and dynamic restructuring of its memory as it grows larger. When to use xMemory For enterprise architects, knowing when to adopt this architecture over standard RAG is critical. According to Gui, “xMemory is most compelling where the system needs to stay coherent across weeks or months of interaction.” Customer support agents, for instance, benefit greatly from this approach because they must remember stable user preferences, past incidents, and account-specific context without repeatedly pulling up near-duplicate support tickets. Personalized coaching is another ideal use case, requiring the AI to separate enduring user traits from episodic, day-to-day details. Conversely, if an enterprise is building an AI to chat with a repository of files, such as policy manuals or technical documentation, “a simpler RAG stack is still the better engineering choice,” Gui said. In those static, document-centric scenarios, the corpus is diverse enough that standard nearest-neighbor retrieval works perfectly well without the operational overhead of hierarchical memory. The write tax is worth it xMemory cuts the latency bottleneck associated with the LLM's final answer generation. In standard RAG systems, the LLM is forced to read and process a bloated context window full of redundant dialogue. Because xMemory's precise, top-down retrieval builds a much smaller, highly targeted context window, the reader LLM spends far less compute time analyzing the prompt and generating the final output. In their experiments on long-context tasks, both open and closed models equipped with xMemory outperformed other baselines, using considerably fewer tokens while increasing task accuracy. However, this efficient retrieval comes with an upfront cost. For an enterprise deployment, the catch with xMemory is that it trades a massive read tax for an upfront write tax. While it ultimately makes answering user queries faster and cheaper, maintaining its sophisticated architecture requires substantial background processing. Unlike standard RAG pipelines, which cheaply dump raw text embeddings into a database, xMemory must execute multiple auxiliary LLM calls to detect conversation boundaries, summarize episodes, extract long-term semantic facts, and synthesize overarching themes. Furthermore, xMemory’s restructuring process adds additional computational requirements as the AI must curate, link, and update its own internal filing system. To manage this operational complexity in production, teams can execute this heavy restructuring asynchronously or in micro-batches rather than synchronously blocking the user's query. For developers eager to prototype, the xMemory code is publicly available on GitHub under an MIT license, making it viable for commercial uses. If you are trying to implement this in existing orchestration tools like LangChain, Gui advises focusing on the core innovation first: “The most important thing to build first is not a fancier retriever prompt. It is the memory decomposition layer. If you get only one thing right first, make it the indexing and decomposition logic.” Retrieval isn't the last bottleneck While xMemory offers a powerful solution to today's context-window limitations, it clears the path for the next generation of challenges in agentic workflows. As AI agents collaborate over longer horizons, simply finding the right information won't be enough. “Retrieval is a bottleneck, but once retrieval improves, these systems quickly run into lifecycle management and memory governance as the next bottlenecks,” Gui said. Navigating how data should decay, handling user privacy, and maintaining shared memory across multiple agents is exactly “where I expect a lot of the next wave of work to happen,” he said.