Tested chunking + embeddings data from 3 production websites. [P]

Our take

In our recent exploration of tiered, page-role-aware retrieval across three diverse content corpora, we analyzed chunking and embeddings from Intercom, HubSpot, and KPMG. Our findings highlight significant differences in content density, with Intercom's high-performing chunks primarily originating from help-center articles, while HubSpot offers concrete case studies. Notably, KPMG's lower scores underscore the challenges of thin content. This analysis reveals how yield scores can forecast the need for nuanced claims in content.

The recent findings on tiered and page-role-aware RAG (Retrieval-Augmented Generation) retrieval results across three distinct corpora illuminate crucial insights into the dynamics of data management and content accessibility. By examining the performance of Intercom, HubSpot, and KPMG, we gain a deeper understanding of how content density and quality affect retrieval efficacy. The stark contrasts in the "yield score" across these brands serve as a compelling reminder of the importance of substantive source material in the age of AI-driven insights. As noted, the high-performing chunks from Intercom and HubSpot predominantly consist of practical help-center articles and concrete case studies, respectively, revealing a direct correlation between content richness and operational effectiveness.

In this context, it's essential to recognize the implications of these findings for organizations striving to enhance their data strategies. The reality that KPMG's corpus yielded significantly fewer high-quality chunks highlights a common challenge many businesses face: the risk of producing content that lacks substance. This situation suggests that merely having a wealth of information is insufficient; the quality and relevance of that information are paramount. As we explore tools like [Spice: We built an open-sourced decision layer that sits above your AI agents (controls agent actions before execution) [P]](/post/spice-we-built-an-open-sourced-decision-layer-that-sits-abov-cmphxwsfu0d8fs0gl2stagnw7), it's clear that organizations must prioritize the curation of meaningful content to optimize their data retrieval processes.

The analysis reveals another critical layer—tier weighting. By assigning different weights to various chunk categories, organizations can significantly influence the composition of top retrieval results. This strategy enhances the ability to surface relevant information, even within less substantive corpora, suggesting that organizations should actively implement tiered approaches to content strategy. For instance, we see that Intercom's yield ratio of 31% and HubSpot's 32% offer insights into their content effectiveness. This finding invites a broader conversation about how organizations can better assess their content quality and adapt their strategies accordingly.

Moreover, the question arises: Are we adequately measuring the quality of our content and its impact on data retrieval? The traditional benchmarks for RAG performance often assume uniformity in the source material, which doesn't reflect the diverse realities of content quality. As organizations continue to innovate, they must move towards a more nuanced understanding of how content density affects retrieval outcomes. This evolution will be vital as we transition into an increasingly AI-driven landscape where the ability to quickly access relevant information can significantly affect decision-making processes.

Looking ahead, the challenge for organizations will be to embrace a future-focused mindset that prioritizes content quality over quantity. As AI systems become more integrated into daily workflows, the need for accessible, actionable insights will only grow. It raises an important question: How will organizations balance the demands of generating substantial content with the necessity for precision and relevance in their data retrieval systems? The answer to this question will shape the next generation of effective data management strategies, ultimately empowering users to transform their workflows and leverage the full potential of their organizational knowledge.

Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:

Workspace	Sources	Chunks	HIGH	MEDIUM	LOW	REJECTED
Intercom	188	941	96	200	541	104
HubSpot	251	1705	40	508	1153	4
KPMG	53	209	3	14	127	65

(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)

87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.

Retrieval probes on KPMG (the worst-case corpus):

"Family business succession" → /private-enterprise.html (cosine 0.721)
"ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
"Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)

So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).

Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.

Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.

submitted by /u/Otherwise_Economy576
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →