Databricks tested a stronger model against its multi-step agent on hybrid queries. The stronger model still lost by 21%.
Our take

Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems.
New research from Databricks puts a number on that failure gap. The company's AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks and reported gains of 20% or more on Stanford's STaRK benchmark suite, along with consistent improvement across Databricks' own KARLBench evaluation framework, according to the research. Databricks argues the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem.
The work builds on Databricks' earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures.
"RAG works, but it doesn't scale," Michael Bendersky, research director at Databricks, told VentureBeat. "If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task."
Single-turn retrieval cannot encode structural constraints
The core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search.
Consider a question like "Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?" The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results.
To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain, according to the research.
STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base.
How the Supervisor Agent handles what RAG cannot
Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps:
Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first.
Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer.
Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required.
"The agent can do things like decomposing the question into a SQL query and a search query out of the box," Bendersky said. "It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found."
It's not just about hybrid retrieval
The distinction Databricks draws isn't about retrieval technique, it's about architecture.
"We almost don't see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables," he said. "We see this more as an agent that has access to multiple tools."
The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code.
Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks' research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format.
"Just bring the agent to the data," Bendersky said. "You basically give the agent more sources, and it will learn to use them pretty well."
What this means for enterprises
For data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest.
The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront.
Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start.
The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one.
"This is kind of like a ladder," Bendersky said. "The agent will slowly get more and more information and then slowly improve overall."
Read on the original site
Open the publisher's page for the full experience
Related Articles
- The retrieval rebuild: Why hybrid retrieval intent tripled as enterprise RAG programs hit the scale wallSomething shifted in enterprise RAG in Q1 2026. VB Pulse data spanning January through March tells a consistent story: the market stopped adding retrieval layers and started fixing the ones it already has. Call it the retrieval rebuild. The survey covered three consecutive monthly waves from organizations with 100 or more employees, with between 45 and 58 qualified respondents per month across platform adoption, buyer intent, architecture outlook and evaluation criteria. The data should be treated as directional. Enterprise intent to adopt hybrid retrieval tripled from 10.3% to 33.3% in a single quarter — even as 22% of qualified enterprise respondents reported having no production RAG systems at all. For data engineers and enterprise architects building agentic AI infrastructure, the data reveals a market in active transition: the RAG architecture most enterprises built to scale is not the one they expect to run by year-end. Hybrid retrieval has become the consensus enterprise strategy. Unlike single-method RAG pipelines that rely on vector similarity alone, hybrid retrieval combines dense embeddings with sparse keyword search and reranking layers, trading simplicity for the retrieval accuracy and access control that production agentic workloads require. The standalone vector database category is under pressure. Weaviate, Milvus, Pinecone and Qdrant each lost adoption share across the quarter in the VB Pulse data. Custom stacks and provider-native retrieval are absorbing their displaced share. A growing minority of enterprises are stepping back from RAG altogether — a signal that the market's maturity narrative has meaningful exceptions. Organizations that went wide on RAG in 2025 are hitting the same failure point: the architecture built for document retrieval does not hold at agentic scale. Enterprises that scaled RAG fast are now paying to rebuild it The two largest intent movements in Q1 are directly connected — enterprises confronting retrieval quality problems at scale, and hybrid retrieval emerging as the consensus answer. Investment priorities shifted in parallel. Evaluation and relevance testing led budget intent in January at 32.8% and fell to 15.6% by March. Retrieval optimization moved in the opposite direction, from 19.0% to 28.9% — overtaking evaluation as the top growth investment area for the first time. Steven Dickens, vice president and practice lead at HyperFRAME Research, described the operational burden enterprise data teams are facing in a VentureBeat interview in March on Oracle's agentic AI data stack. "Data teams are exhausted by fragmentation fatigue," Dickens said. "Managing a separate vector store, graph database and relational system just to power one agent is a DevOps nightmare." That fatigue shows directly in the platform data. The custom stack rise to 35.6% is not a rejection of managed retrieval — many organizations run both. It is a consolidation response from engineering teams that have hit the limits of assembling too many components. Not every enterprise has made it that far. The VB Pulse data includes a signal that complicates the market's overall growth narrative: 22.2% of qualified respondents reported no production RAG by March, up from 8.6% in January. The report attributes this cohort to organizations that have "not yet committed to any retrieval infrastructure, or have paused programs" — concentrated in Healthcare, Education and Government, the same sectors showing the highest rates of flat budgets. Standalone vector databases are losing the adoption argument but winning the reliability one Recent reporting by VentureBeat illustrates why the dedicated retrieval layer still matters in production. Two enterprises building on Qdrant show why purpose-built vector infrastructure still wins in production. &AI builds patent litigation infrastructure and runs semantic search across hundreds of millions of documents. Grounding every result in a real source document is not optional — patent attorneys will not act on AI-generated text. That requirement makes the architectural choice clear. "The agent is the interface," Herbie Turner, &AI's founder and CTO, told VentureBeat in March. "The vector database is the ground truth." GlassDollar, a startup that helps Siemens and Mahle evaluate startups, runs an agentic retrieval pattern across a corpus approaching 10 million indexed documents. A single user prompt fans out into multiple parallel queries, each retrieving candidates from a different angle before results are combined and re-ranked. That query volume and precision requirement is what drove the choice of purpose-built vector infrastructure. "We measure success by recall," Kamen Kanev, GlassDollar's head of product, told VentureBeat in March. "If the best companies aren't in the results, nothing else matters. The user loses trust." The VB Pulse data shows that framing — retrieval as ground truth rather than feature — is gaining traction across the broader enterprise market, even as standalone vector database adoption declines. Why enterprises say they need a dedicated vector layer shifted significantly across Q1. In January the top reasons were access control complexity (20.7%) and retrieval precision (19.0%). By March, operational reliability at scale had surged to 31.1% — more than doubling and overtaking everything else. Enterprises are no longer keeping vector infrastructure primarily for precision. They are keeping it because it is the part of the stack they can rely on when query volumes scale. How enterprises are redefining what good retrieval means How enterprises judge their retrieval systems shifted notably across Q1 — and the direction of that shift points to a market getting more sophisticated about what good retrieval actually means. In January, response correctness dominated evaluation criteria at 67.2% — far above anything else. By March, response correctness (53.3%), retrieval accuracy (53.3%) and answer relevance (53.3%) had converged exactly. Getting the right answer is no longer enough if it came from the wrong document or missed the context of the question. Answer relevance was the only criterion that rose across the quarter, gaining five percentage points. It is also the hardest to measure — whether the retrieved context is actually the right context for that specific question requires purpose-built evaluation infrastructure, not just pass-or-fail correctness checks. Its rise signals that a meaningful share of enterprise buyers have moved past basic RAG testing entirely. The market's verdict: RAG isn't dead. The original architecture is The "RAG is dead" narrative had real momentum heading into 2026. It rested on two claims. The first: that long-context windows — models capable of processing hundreds of thousands of tokens in a single prompt — would make dedicated retrieval unnecessary. The second: that agentic memory systems, which store what an agent learns across sessions rather than retrieving it fresh each time, would absorb the knowledge access problem entirely. The VB Pulse data is the enterprise market's answer to the first claim. The long-context-as-dominant-architecture position collapsed from 15.5% in January to 3.5% in February before partially recovering to 6.7% in March. January's sample was heavily weighted toward Technology and Software respondents — the segment most exposed to long-context model announcements in late 2025. As the sample diversified, the position evaporated. On the memory question, Jonathan Frankle, chief AI scientist at Databricks, framed the architecture clearly in a March interview with VentureBeat: a vector database with millions of entries sits at the base of the agentic memory stack, too large to fit in context. The LLM context window sits at the top. Between them, new caching and compression layers are emerging — but none of them replace the retrieval layer at the base. New agentic memory systems like Hindsight, developed by Vectorize, and observational memory approaches like those in the Mastra framework address session continuity and agent context over time — a different problem than high-recall search across millions of changing enterprise documents. The most consequential signal: the share of respondents not expecting large-scale RAG deployments by year-end grew from 3.4% to 15.6% — nearly 5x. That is not a verdict against retrieval. It is a verdict against the retrieval architecture most enterprises built first. The retrieval rebuild is not optional The retrieval rebuild is the cost of scaling RAG without first deciding what architecture could actually support it. If your organization is among the 43.1% that entered Q1 planning to expand RAG into more workflows, the VB Pulse data suggests that plan has already changed for many of your peers — and may need to change for you. Hybrid retrieval is the consensus destination. Custom stack growth to 35.6% reflects teams building retrieval infrastructure around requirements that off-the-shelf products do not fully address. RAG is not dead. The architecture most enterprises used to implement it is. The data suggests the rebuild is not a future decision. For 33% of enterprises, the rebuild is already the stated priority.
- RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at riskEnterprise teams that fine-tune their RAG embedding models for better precision may be unintentionally degrading the retrieval quality those pipelines depend on, according to new research from Redis. The paper, "Training for Compositional Sensitivity Reduces Dense Retrieval Generalization," tested what happens when teams train embedding models for compositional sensitivity. That is the ability to catch sentences that look nearly identical but mean something different — "the dog bit the man" versus "the man bit the dog," or a negation flip that reverses a statement's meaning entirely. That training consistently broke dense retrieval generalization, how well a model retrieves correctly across broad topics and domains it wasn't specifically trained on. Performance dropped by 8 to 9 percent on smaller models and by 40 percent on a current mid-size embedding model teams are actively using in production. The findings have direct implications for enterprise teams building agentic AI pipelines, where retrieval quality determines what context flows into an agent's reasoning chain. A retrieval error in a single-stage pipeline returns a wrong answer. The same error in an agentic pipeline can trigger a cascade of wrong actions downstream. Srijith Rajamohan, AI Research Leader at Redis and one of the paper's authors, said the finding challenges a widespread assumption about how embedding-based retrieval actually works. "There's this general notion that when you use semantic search or similar semantic similarity, we get correct intent. That's not necessarily true," Rajamohan told VentureBeat. "A close or high semantic similarity does not actually mean an exact intent." The geometry behind the retrieval tradeoff Embedding models work by compressing an entire sentence into a single point in a high-dimensional space, then finding the closest points to a query at retrieval time. That works well for broad topical matching — documents about similar subjects end up near each other. The problem is that two sentences with nearly identical words but opposite meanings also end up near each other, because the model is working from word content rather than structure. That is what the research quantified. When teams fine-tune an embedding model to push structurally different sentences apart — teaching it that a negation flip which reverses a statement's meaning is not the same as the original — the model uses representational space it was previously using for broad topical recall. The two objectives compete for the same vector. The research also found the regression is not uniform across failure types. Negation and spatial flip errors improved measurably with structured training. Binding errors — where a model confuses which modifier applies to which word, such as which party a contract obligation falls on — barely moved. For enterprise teams, that means the precision problem is harder to fix in exactly the cases where getting it wrong has the most consequences. The reason most teams don't catch it is that fine-tuning metrics measure the task being trained for, not what happens to general retrieval across unrelated topics. A model can show strong improvement on near-miss rejection during training while quietly regressing on the broader retrieval job it was hired to do. The regression only surfaces in production. Rajamohan said the instinct most teams reach for — moving to a larger embedding model — does not address the underlying architecture. "You can't scale your way out of this," he said. "It's not a problem you can solve with more dimensions and more parameters." Why the standard alternatives all fall short The natural instinct when retrieval precision fails is to layer on additional approaches. The research tested several of them and found each fails in a different way. Hybrid search. Combining embedding-based retrieval with keyword search is already standard practice for closing precision gaps. But Rajamohan said keyword search cannot catch the failure mode this research identifies, because the problem is not missing words — it is misread structure. "If you have a sentence like 'Rome is closer than Paris' and another that says 'Paris is closer than Rome,' and you do an embedding retrieval followed by a text search, you're not going to be able to tell the difference," he said. "The same words exist in both sentences." MaxSim reranking. Some teams add a second scoring layer that compares individual query words against individual document words rather than relying on the single compressed vector. This approach, known as MaxSim or late interaction and used in systems like ColBERT, did improve relevance benchmark scores in the research. But it completely failed to reject structural near-misses, assigning them near-identity similarity scores. The problem is that relevance and identity are different objectives. MaxSim is optimized for the former and blind to the latter. A team that adds MaxSim and sees benchmark improvement may be solving a different problem than the one they have. Cross-encoders. These work by feeding the query and candidate document into the model simultaneously, letting it compare every word against every word before making a decision. That full comparison is what makes them accurate — and what makes them too expensive to run at production scale. Rajamohan said his team investigated them. They work in the lab and break under real query volumes. Contextual memory. Also sometimes referred to as agentic memory, these systems are increasingly cited as the path beyond RAG, but Rajamohan said moving to that type of architecture does not eliminate the structural retrieval problem. Those systems still depend on retrieval at query time, which means the same failure modes apply. The main difference is looser latency requirements, not a precision fix. The two-stage fix the research validated The common thread across every failed approach is the same: a single scoring mechanism trying to handle both recall and precision at once. The research validated a different architecture: stop trying to do both jobs with one vector, and assign each job to a dedicated stage. Stage one: recall. The first stage works exactly as standard dense retrieval does today — the embedding model compresses documents into vectors and retrieves the closest matches to a query. Nothing changes here. The goal is to cast a wide net and bring back a set of strong candidates quickly. Speed and breadth are what matter at this stage, not perfect precision. Stage two: precision. The second stage is where the fix lives. Rather than scoring candidates with a single similarity number, a small learned Transformer model examines the query and each candidate at the token level — comparing individual words against individual words to detect structural mismatches like negation flips or role reversals. This is the verification step the single-vector approach cannot perform. The results. Under end-to-end training, the Transformer verifier outperformed every other approach the research tested on structural near-miss rejection. It was the only approach that reliably caught the failure modes the single-vector system missed. The tradeoff. Adding a verification stage costs latency. The latency cost depends on how much verification a team runs. For precision-sensitive workloads like legal or accounting applications, full verification at every query is warranted. For general-purpose search, lighter verification may be sufficient. The research grew out of a real production problem. Enterprise customers running semantic caching systems were getting fast but semantically incorrect responses back — the retrieval system was treating similar-sounding queries as identical even when their meaning differed. The two-stage architecture is Redis's proposed fix, with incorporation into its LangCache product on the roadmap but not yet available to customers. What this means for enterprise teams The research does not require enterprise teams to rebuild their retrieval pipelines from scratch. But it does ask them to pressure-test assumptions most teams have never examined — about what their embedding models are actually doing, which metrics are worth trusting and where the real precision gaps live in production. Recognize the tradeoff before tuning around it. Rajamohan said the first practical step is understanding the regression exists. He evaluates any LLM-based retrieval system on three criteria: correctness, completeness and usefulness. Correctness failures cascade directly into the other two, which means a retrieval system that scores well on relevance benchmarks but fails on structural near-misses is producing a false sense of production readiness. RAG is not obsolete — but know what it can't do. Rajamohan pushed back firmly on claims that RAG has been superseded. "That's a massive oversimplification," he said. "RAG is a very simple pipeline that can be productionized by almost anyone with very little lift." The research does not argue against RAG as an architecture. It argues against assuming a single-stage RAG pipeline with a fine-tuned embedding model is production-ready for precision-sensitive workloads. The fix is real but not free. For teams that do need higher precision, Rajamohan said the two-stage architecture is not a prohibitive implementation lift, but adding a verification stage costs latency. "It's a mitigation problem," he said. "Not something we can actually solve."
- Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]I've had a RAG system running in production for a few months now (legal domain, German regulatory documents). It handles 80% of queries well but there are three patterns where it fails predictably and I haven't found clean solutions. The scatter problem. Some questions need information from 8-10 different documents where each one contributes just a small piece. Vector search finds chunks related to the query but not chunks related to each other. So when someone asks something like "compare how notification deadlines work across different German federal states" the system finds 2-3 state-specific documents that happen to match the query well and misses the rest. The answer looks complete but it's actually partial. Cranking up k adds noise and burns tokens without reliably solving it because the missing documents might use completely different terminology for the same concept. I've thought about query decomposition (break the question into sub-queries per state) but that assumes the system knows upfront how many sub-queries to generate and what dimensions to decompose along. For a general-purpose research tool that feels brittle. The negative knowledge problem. When someone asks "do we have any guidance on employee monitoring" and the answer is genuinely no, the system can't cleanly say that. It retrieves whatever chunks are least irrelevant, and the LLM synthesizes something from them anyway. The user gets a confident-sounding answer about a tangentially related topic instead of a straightforward "this isn't covered in your knowledge base." I've tried similarity score thresholds as a gate but the problem is there's no clean boundary. A legitimate but unusual query might have low similarity scores. A genuinely off-topic query might match some chunks reasonably well because of shared vocabulary. Every threshold I've tested either filters out too much or too little. The prompt instruction to admit uncertainty helps maybe 60% of the time. The other 40% the model just reaches. The timeline problem. Questions like "how did the interpretation of X change after the 2023 ruling" require the system to find pre-ruling documents, find post-ruling documents, understand the temporal relationship, and construct a comparative narrative. The metadata has document dates. The prompt says to respect temporal ordering. But the model struggles to build a coherent before/after story when the retrieved chunks don't explicitly reference each other. It tends to either merge everything into one flat answer or just cite the newer source and ignore the older interpretation. This feels like it needs a fundamentally different retrieval approach (maybe temporal filtering at the search level, or separate retrievals for different time periods) rather than more prompt engineering. I've been reading about graph RAG approaches, agentic retrieval loops, and multi-hop reasoning chains but most of the literature is benchmarks on synthetic datasets, not production implementations. If anyone has actually deployed solutions for any of these three patterns I'd really like to hear what worked and what didn't. Especially interested in approaches that don't require restructuring the entire pipeline. submitted by /u/Fabulous-Pea-5366 [link] [comments]