Merck and Mastercard are seeing real agentic AI results. Both say the plumbing came first.
Our take

The recent advancements in agentic AI applications at Merck and Mastercard highlight a crucial lesson for organizations venturing into the AI landscape: robust infrastructure is foundational to innovation. Merck's experience, as articulated by VP of Digital Platforms Sean Finnerty, reveals that their significant improvements in drug discovery and marketing material compliance stem not just from the AI itself, but from a strategically constructed digital "plumbing" that supports these innovations. This mirrors the broader industry shift towards recognizing the importance of a strong foundational infrastructure, a topic that has been brought to light in various conversations, such as those surrounding the implications of AI in sectors like finance and healthcare. For instance, Google just broke SEO. Here’s what replaces it. underscores the need for reliable frameworks as we transition into an AI-centric world.
The significance of this plumbing-first approach cannot be overstated. It acts as a safeguard against the pitfalls of disjointed solutions that can lead to what Finnerty terms “debt” — an accumulation of outdated systems that stifle further innovation. By investing in a cohesive infrastructure, companies can better facilitate the integration and operation of various AI agents, enabling them to operate more efficiently and effectively across multiple workflows. As AI begins to permeate enterprise operations, the lessons learned from Merck’s successes will undoubtedly inform other organizations looking to harness AI for transformative results.
Moreover, the insights shared by Finnerty about the practical applications of AI in drug discovery and marketing speak volumes about the potential of these technologies to redefine timelines and enhance productivity. For example, the ability to generate marketing drafts that are "99% right" in terms of compliance is a game changer for an industry notorious for its regulatory complexities. Not only does this accelerate delivery times significantly, but it also frees human resources to focus on strategic oversight rather than getting bogged down in the minutiae of compliance checks. This is reminiscent of the challenges faced in sectors like financial services, where organizations are grappling with the nuances of trust and efficiency, as detailed in Use cases for agentic AI in financial services.
Looking ahead, the journey toward fully leveraging AI will undoubtedly come with challenges. The “wackiness” Finnerty encountered, where AI generated nonsensical scenarios, serves as a reminder that while AI has advanced, it is not infallible. The need for guardrails and supervision in AI operations emphasizes the importance of a thoughtful approach to AI deployment, one that combines human oversight with automated processes. This balance will be vital as organizations look to scale their AI initiatives while maintaining control over the outcomes.
As the industry progresses, one question remains: how will organizations ensure the integrity of their AI systems while navigating the complexities of implementation? The path to successful AI integration is multifaceted, involving not just the technological capabilities but also a commitment to continual learning and adaptation. The experiences of Merck and Mastercard offer valuable lessons for others in the field, illustrating the need for a robust infrastructure that supports innovative solutions while instilling confidence in their use. The future of AI in enterprise settings will depend on how effectively organizations can learn from these early adopters and build frameworks that empower their data-driven aspirations.
Merck is using AI agents to cut drug discovery cycles by a third and ship compliant marketing materials up to 80% faster — but VP of Digital Platforms Sean Finnerty says the only reason it's working is because they built the infrastructure first.
And the pharmaceutical manufacturer is seeing promising early results: AI is generating marketing drafts that are “99% right” when it comes to compliance, shrinking review cycles from months to days and accelerating delivery by 70% to 80%. In the company’s medical research, meanwhile, one AI-assisted discovery cycle was reduced by 33%.
Still, agentic AI only works if companies first build the underlying “plumbing,” Finnerty said of digital platforms and services at a recent AI Impact Series event.
“If we do one-offs, we're gonna end up with thousands and thousands of things that are ultimately just gonna be debt that we'll have to deal with later,” he said. “And that's gonna be a drag on any further innovation.”
Starting with the plumbing
Merck’s plumbing-first strategy comes from lessons learned during the early days of cloud in the 2010s “when nobody knew what the heck was going on,” Finnerty said.
Getting the cloud right meant building from the ground up; at Merck, that infrastructure now supports 2,500 AWS accounts, numerous Microsoft Azure subscriptions, and new Google Cloud Platform (GCP) integrations.
“AI is gonna be the same exact thing,” Finnerty said. “We're going to have thousands and thousands of agents.” The questions then pile up: How do you register them? How do you secure them? How do you ensure they're connected to the right tools, and have access to the right data and the right context?
Context delivery is also critical; Merck works with three hyperscalers and has forty-seven edge locations and hundreds of databases. “Many, many petabytes” of structured and unstructured data are stored in Oracle databases, SQL databases, Excel spreadsheets, phone transcripts, and other repositories, Finnerty said.
His team is building scaffolding to deliver meaningful context in various situations, he explained. Data must be organized and ingested into various platforms, because “there’s no one solution to solve every single problem.” Sometimes it's Databricks, other times it's Amazon Redshift, “plus four other things.”
The goal is: “Let's make that easy and frictionless for people to do, and secure it, and make sure it's well integrated with MCP [model context protocol], and A2A [Agent2Agent], and upstream compute,” Finnerty said. “If you wanna run stuff on GCP or you wanna run stuff on AWS, we've got the plumbing in place so you can run your adjacent workloads wherever you want.”
How Merck is using agents
As it builds out its technical plumbing, Merck is experimenting with agents across regulated enterprise operations, scientific discovery workflows, and app modernization.
Notably, AI is accelerating drug discovery. Finnerty explained that scientists look at molecular structures and disease states to determine if a given condition is druggable. But even if a disease state is known, developing a drug to target it can take years.
Now with AI, teams are starting to see “very promising things,” such as cutting one particular research cycle down by one-third. “That's a year off of the life of the discovery cycle,” Finnerty said. “Which means, theoretically, we can get it to a patient who needs that therapy a year faster.”
Once developed and approved, these products are regulated and marketing materials around them must be clearly and explicitly articulated. “The way you communicate that information per market, per country, per state, per region, is all very carefully governed and regulated,” Finnerty said. It’s also variable: An ad campaign for a vaccine in the state of Georgia looks much different from one launched in Canada.
Historically, humans did the due diligence to make sure the company complied with various laws. Draft materials go through iterations of reviews; when a mistake is discovered, it gets “kicked back to the beginning, and it goes through it again, and then it takes another however many weeks and months,” Finnerty said.
But now, AI can do that “much, much more effectively,” and the process is increasingly evolving from a human-in-the-loop to essentially a "human-as-governor." With human oversight, AI can deliver a first draft in a day or week that is 99% there, allowing teams to ship materials up to 80% faster.
Meanwhile, when it comes to app modernization, AI can discover architecture, document data interactions, APIs, network paths, and do authentication checks and authorization; it can also write code for Terraform for deployment and refactor JavaScript into Python.
Where the company would have previously spent weeks and months and hundreds of thousands of dollars to update one application, Finnerty said, agents are now handling the work through prompts.
Running into "wackiness"
That’s not to say there aren’t significant challenges; Finnerty noted that his team has run into some “wackiness”; for example in automated code and scenario testing. AI has blatantly made up scenarios, whether due to incorrect context, infrastructure, “or if it was just getting creative with, ‘You should be testing these three functions that don't even exist in the code that you're trying to test.’”
“That surprised me a little bit because I thought we were further past some of the hallucination challenges in these later models,” he said.
To address this, his team has engineered guardrails to keep hallucinations to a minimum, essentially using AI to supervise AI and applying confidence scores. So if Claude created the first output, they’ll instruct Microsoft Copilot to assess it.
“So if you ask something once, have AI check it, then ask it a third time, the confidence increases every time, and it minimizes some of the garbage that gets created in the early runs,” Finnerty said.
Use cases for agentic AI in financial services
Meanwhile, at Mastercard, Chief Data Officer Andrew Reiskind and his team are focusing agentic experimentation on highly orchestrated transaction and dispute workflows. As he noted, a chargeback or fraud dispute is not a single event.
When a consumer disputes a charge (typically online), that “kicks off an entire other process on the back-end that tends to be very labor-intensive,” Reiskind said.
Mastercard has to collect specifics about the actual dispute; then the merchant has its own investigations (Was the card reported as lost or stolen? Does the consumer dispute charges often?). Further, the network sitting in the middle has its own rules for timing and information submission.
“You have each and every one of these steps, many of which are unstructured, but there are also structured data elements to this,” Reiskind said. Whether a card was lost or stolen tends to be structured, but the consumer complaint is “unstructured data of questionable reliability.”
“So you're sitting there with a decisioning system that has deterministic decisions, but also probabilistic decisions,” he said.
This problem can be sped up and potentially solved by AI agents, but that can be a complex process: Which tasks are you handing off to agents? When are they kicking things back to human reps? How many agents are you ultimately using? What are the cost implications?
Then there are reputational questions and costs: Have you just called a consumer potentially a liar when they weren't lying?
“It's an exact problem where you want to, as a bank, maintain trust with your consumer,” Reiskind said. “But you also wanna make this efficient and take costs out of the system.”
The PB&J versus turkey mistake: Determine what risks are acceptable
There’s always going to be risk with AI, and enterprises should assess it from the beginning of product design, Reiskind said. There’s also the question of acceptable risk.
As an example: Did you serve a customer a peanut butter jelly sandwich instead of a turkey sandwich (a minor inconvenience)? Or did you serve gluten to someone with celiac disease?
“Is it an acceptable risk if one percent of the time it makes the mistake? If it is, let's go to the next stage of how you're mitigating that risk,” Reiskind said.
Leaders must perform cost-benefit analysis, break problems down to their “constituent pieces,” and calculate cost for each one. But these are estimates; it’s near-impossible to forecast real usage, Reiskind said. “It is not a simple process to get to the cost,” he said. “But it is doable.”
Read on the original site
Open the publisher's page for the full experience
Related Articles
- The three disciplines separating AI agent demos from real-world deploymentGetting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries. “The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.” Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy. In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments. “People have been experimenting a lot with proof of concepts, they've been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.” Why agents keep failing in production Enterprises are eager to adopt agentic AI in some form or another — often because they're afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design. The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not. But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out. This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said. “Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic. The tuning loop Creatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy. That loop typically follows this pattern: Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents. Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access. Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time. Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources. Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs. For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding — that sits "above the LLM," Kawasaki said. Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said. The biggest issues that come up post-deployment: Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned. Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate. Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails. “We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn't happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.” "Data readiness" doesn’t always require an overhaul When looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project. But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database. This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said. Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows). Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.” Matching agents to the work The best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals. “Especially when you can link them to very specific processes inside an industry — that's where you can really measure and deliver hard ROI,” he said. For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services. “You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions. However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales. “The agent isn't giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.” This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories. This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said. The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates. “What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said. Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs. “The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed? “Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said.
- How MassMutual and Mass General Brigham turned AI pilot sprawl into production resultsEnterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl. At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two. “We're always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?” Defining metrics, establishing strong feedback loops MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas. Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint. “We won't go any further with an idea until we get crystal clear on how we're going to measure, and how we're going to define success.” Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners. That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn't shared clarity on what outcome we're trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’” His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what "good" means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift. Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn't accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn't mean starting over. Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don't want to set ourselves up to fall behind.” Weeding instead of letting a thousand flowers bloom Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event. But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn't have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said. Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required. Sriraman's team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out). As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.” That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You're gonna get six different answers.” There's nothing wrong with that, he noted; it's just that everybody is discovering and experimenting as the landscape keeps shifting. Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use. They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said. Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges. In clinical settings, the guardrails are absolute: AI systems never issue the final decision. "There's always going to be a doctor or a physician assistant in the loop to close the decision," Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off. Sriraman was clear: "Thou shall not do this: Don't show PHI [protected health information] in Perplexity. As simple as that, right?" And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.” Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the '90s and 2000s with AI. The same concepts apply.”
- Designing the agentic AI enterprise for measurable performancePresented by Edgeverve Smart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes clear goals, data‑driven workflows, and an enterprise platform that balances autonomy, governance, observability, and flexibility with hard guardrails from day one. From pilots to the “operational grey zones” The next wave of value sits in the connective tissue between applications — those operational grey zones where handoffs, reconciliations, approvals, and data lookups still rely on humans. Assigning agents to these paths means collapsing system boundaries, applying intelligence to context, and re‑imagining processes that were never formally automated. Many pilots stall because they start as lab experiments rather than outcome‑anchored designs tied to production systems, controls, and KPIs. Start with outcomes, not algorithms. Translate organizational KPIs (cash‑flow, DSO, SLA adherence, compliance hit rates, MTTR, NPS, claims leakage, etc.) into agent goals, then cascade them into single‑agent and multi‑agent objectives. Only after goals are explicit should you select workflows and decompose tasks. Pick targets, then decompose the work What does “target” actually mean? In agentic programs, a target is a business outcome and the use case that moves it. For example, “reduce unapplied cash by 20%” target outcome; “cash application and exceptions handling” use case. With the use case in hand, perform persona‑level task decomposition: map the human role (e.g., cash applications analyst, facilities coordinator), enumerate their tasks, and identify which are ripe for agentification (data retrieval, matching, policy checks, decision proposals, transaction initiation). Delivering on those tasks requires a data‑embedded workflow fabric that can read, write, and reason across enterprise systems while honoring permissions. Data must be AI‑ready, discoverable, governed, labeled where needed, augmented for retrieval (RAG), and policy‑protected for PII, PCI, and regulatory constraints. Integration goes beyond APIs APIs are one mode of integration, not the only one. Robust agent execution typically blends: Stable APIs with lifecycle management for core systems Event‑driven triggers (streams, webhooks, CDC) to react in real time UI/RPA fallbacks where APIs don’t exist Search/RAG connectors for documents and knowledge bases Policy management across tools and actions to enforce entitlements and segregation of duties The north star is integration reliability — built on idempotency, retries, circuit-breakers, and standardized tool schemas — so agents don’t “hallucinate” actions the enterprise can’t verify. A quick example: finance and facilities, in production Inside our organization, we deployed specialized agents in a live CFO environment and in building maintenance. In finance, seven agents interacted with production systems and real accountability structures. Year‑one outcomes included: >3% monthly cash‑flow improvement, 50% productivity gain in affected workflows, 90% faster onboarding, a shift from account‑level handling to function‑level orchestration, and a $32M cash‑flow lift. These results don’t guarantee gains everywhere; they show that designing products can deliver measurable outcomes on a scale. The four design pillars: Autonomy, governance, observability & evals, flexibility 1) Autonomy: right‑size it to the risk Autonomy exists on a spectrum. Early efforts often automate well‑bounded tasks; others pursue research/analysis agents; increasingly, teams target mission‑critical transactional agents (payments, vendor onboarding, pricing changes). The rule: match autonomy to risk, and encode the operating mode suggest‑only, propose‑and‑approve, or execute‑with‑rollback per task. 2) Governance: guardrails by design, not as bolt‑ons Unbounded agents create unacceptable risk. Build guardrails into the plan: Policy & permissions: tie tools/actions to identity, scopes, and SoD rules. Human‑in‑the‑loop (HITL): where mission‑critical thresholds are crossed (amount, vendor risk, regulatory exposure). Agent lifecycle management: versioning, change control, regression gates, approval workflows, and sunsetting. Third‑party agent orchestration: vet external agents like vendors, capabilities, scopes, logs, SLAs. Incident and rollback: kill‑switches, safe‑mode, and compensating transactions. This is how you scale innovation safely while protecting brand, compliance, and customers. 3) Observability & evaluations: trust comes from telemetry Production agents need the same rigor as any core platform: Telemetry: capture full execution traces across perception, planning, tool use, action supported by structured logs and replay. Offline evals: cenario tests, red‑teaming, bias and safety checks, cost/performance benchmarks; baseline vs. challenger comparisons. Online evals: shadow mode, A/B, canary releases, guardrail breach alerts, human feedback loops. Explainability & auditability: why was an action taken, which data/tools were used, and who approved. 4) Flexibility: assume volatility, design for swap‑ability Models, tools, and vendors change fast. Treat agentic capability as platform currency: create an environment where teams can evaluate, select, and swap models/tools without tearing down the build. Use a model router, tool registry, and contract‑first interfaces so upgrades are controlled experiments, not rewrites. The agent platform fabric: how platformization turns goals into outcomes A true agentic enterprise requires a platform fabric that transforms goals into outcomes, not a patchwork of isolated pilots. This platform anchors enterprise‑to‑agent KPI cascades, drives task decomposition and multi‑agent planning, and provides governed tooling and data access across APIs, RPA, search, and databases. It centralizes knowledge and memory through RAG and vector stores, enforces enterprise controls via a policy engine, and manages performance and safety through a unified model layer. It supports robust orchestration of first‑ and third‑party agents with common context, embeds deep observability and evaluation pipelines, and applies disciplined release engineering from sandbox to GA. Finally, it ensures long‑term resilience through lifecycle management versioning, deprecation, incident playbooks, and auditable histories. Guardrails in action: a BFSI example Consider payments exception handling in banking — high stakes, regulated, and customer‑visible. An agent proposes a resolution (e.g., auto‑reconcile or escalate) only when: The transaction falls below risk thresholds; above them, it triggers HITL approval. All policy checks (KYC/AML, velocity, sanctions) pass. Observability hooks record rationale, tools invoked, and data used. Rollback/compensation is defined if downstream failures occur. This pattern generalizes to vendor onboarding, pricing overrides, or claims adjudication — mission‑critical work with explicit safety rails. Scale beyond pilots Scaling agentic AI beyond pilots demands disciplined readiness across nine fronts: leaders must clarify which KPIs matter and how agent goals ladder into them, determine which persona tasks are agentified versus remain human‑led, and align each with the right autonomy mode from suggest‑only to propose‑and‑approve to execute‑with‑rollback. They must embed governance guardrails, including HITL points and lifecycle controls; ensure robust observability and evaluation via telemetry, replay, audits, and offline/online tests; and verify data readiness, with governed, policy‑protected, retrieval‑augmented data flows. Integration must be reliable, with API lifecycle management, event triggers, and RPA/other fallbacks. The underlying platform should enable model swap‑ability and orchestration of first‑ and third‑party agents without rebuilding. Finally, measurement must focus on true operational impact cash flow, cycle times, quality, and risk reduction rather than task counts. The takeaway Agentic AI is not a shortcut; it’s a new system of work. Enterprises that approach it with platform discipline aligning autonomy with risk, embedding governance and observability, and designing for swap‑ability will convert pilots into production impact. Those that don’t keep accumulating impressive but disconnected demos. The difference isn’t how fast you ship an agent; it’s how deliberately you design the enterprise around it. N. Shashidar is SVP & Global Head, Product Management at EdgeVerve. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
- The AI governance mirage: Why 72% of enterprises don’t have the control and security they think they doDecision makers at 72% of organizations claim to have two or more AI platforms that they identify as their "primary" layer, according to a survey of 40 enterprise companies conducted by VentureBeat last month, revealing real gaps in security and control. For enterprise management and technical leaders, and especially security leaders, these multiple AI platforms extend the attack surfaces of most enterprises at a time when AI-driven attacks have become increasingly potent. The multiple platforms — which include offerings from hyperscaler or AI labs like Microsoft Azure, Google, OpenAI or Anthropic, or big application companies like Epic, Workday or ServiceNow — reflect a state of sprawl that has emerged as these big software providers rush to offer their own AI to their enterprise customers. Those customers, in their own rush to scale AI, are finding they aren’t building a singular strategy — in fact they may be building a collection of contradictions. The strategic paradox: why leading enterprises are building around their vendors For example, take the strategic paradox faced by Mass General Brigham (MGB) hospital system, which has 90,000 employees and is the largest employer in Massachusetts. The hospital system last year had to shut down an uncontrolled number of internal proof of concepts that had sprouted up as employees had gotten carried away with AI projects, said CTO Nallan “Sri” Sriraman at the VentureBeat AI Impact event in Boston on March 26, which focused on the challenges of scaling AI. Instead, the company decided it was better to wait for the software giants it already uses to deliver on their AI roadmaps. Since these companies have so many resources, and were making AI a top priority themselves, it made no sense for MGB to try to build its own AI layer that would be duplicative, he said. "Why are we building it ourselves?" he asked. "Leverage it." Yet, even then, Sriraman’s team has been forced to build workarounds, where those companies haven’t done enough. For example, MGB has just completed a “full-scaled” custom build around Microsoft’s Copilot — to get essentially everything offered by that tool — by putting a "skin" around Copilot to handle the safety and data privacy concerns the major model providers haven't yet mastered. Specifically, MGB needed a way for employees to prompt the AI and not have their protected health information (PHI) leaked back to the Copilot LLM provider, OpenAI. The new secure platform, which can support up to 30,000 users, is really the ultimate contradiction: Even though the company has a mandate to leverage the AI provided by the bigger companies, it needs to build around its failures. The contradiction goes even further. These software vendors used by MGB — which also include Epic, Workday and ServiceNow — are all now building agents for their AI, all operating differently. So MGB has to invest in building a “control plane that coordinates and orchestrates all of these agents,” Sriraman said. “That’s where our investment is going to be.” He noted that companies like his are “discovering and experimenting as the landscape keeps shifting." The marketplace is "still nascent," he said, which makes decisions difficult. The "six blind men" problem Sriraman explained the current vendor landscape with an analogy: "When you ask six blind men to touch an elephant and say, what does this elephant look like?" Sriraman said. "You're gonna get six different answers." What emerges from the research VentureBeat conducted in the first quarter, along with conversations like the one in Boston, is a situation that we at VentureBeat are calling a “governance mirage.” While many enterprises say they have adequate governance, in reality they haven’t created clear accountability or specific guardrails, evaluations or security processes to ensure that governance. The data of disconnect: confidence vs. systematic oversight The research comes from surveys across January, February and March by VentureBeat of enterprise companies with 100 or more employees, with 40 to 70 qualified respondents per topic area — covering agentic orchestration, AI security, RAG and governance. The data lacks statistical significance in many areas and should be treated as directional. The research on governance found that a majority, or 56%, of respondents said they are “very confident” that they’d detect a misbehaving AI model, suggesting that most decision-makers believe they have sufficient basic governance at their companies. However, nearly a third of respondents have no systematic mechanism to detect AI misbehavior until it surfaces through users or audits. In a world where telemetry leakage accounts for 34% of GenAI incidents (Wiz), and the global average breach cost has hit $4.4M (IBM 2025 Cost of a Data Breach), finding out after the damage is done is the default for too many companies. Moreover, 43% of respondents say a central team owns AI governance. That sounds reassuring — until you look at what’s happening everywhere else. Twenty-three percent say governance is unclear or actively contested between teams. Twenty percent say each platform team governs independently. Six percent say no one has formally addressed it. The rest said they were unsure who owned it. More telling is the barrier data. When asked about the single biggest obstacle to governing AI across platforms, “no single owner or accountable team” ranked second at 29% — just behind vendor opacity. Accountability structure and lack of vendor transparency are the two dominant failure modes, and they compound each other: Without a central owner, no one has the mandate to demand transparency from the vendors. The day-two bill: managing sprawl, creep, and lock-in The scaling trap: Red Hat’s warning Brian Gracely, Senior Director at Red Hat, who also spoke at the VentureBeat Boston event last month, addressed the infrastructure side of this sprawl, warning that many enterprises are falling into a trap of deceptive initial wins. Gracely noted that the barrier to entry is almost nonexistent at the start, with nearly anyone able to spin up a project using a credit card and an API key. "Day zero is very, very easy," Gracely said. "Day two is when the bill comes due." Red Hat is positioning its software layer (OpenShift AI) as the necessary buffer to prevent enterprises from getting buried in a single provider's proprietary ecosystem. Gracely’s point is direct: If your control system is built entirely inside one cloud provider’s toolset, you are effectively "renting a cage." The illusion of speed in the early pilot phase often hides a technical debt that becomes obvious the moment you try to move your AI work to a different platform. Gracely illustrated this with a recent example. A senior leader from Red Hat’s centralized CTO office spent part of her vacation contributing to an open-source agent project called OpenClaw, which became widely popular in the first quarter. Within days of her name appearing as a project maintainer, Red Hat was fielding calls from major New York banks. Their problem was immediate: They realized they already had upwards of 10,000 employees bringing "claws" — agent-based tools — into their infrastructure with zero centralized oversight. Breaches caused by employees working on these sorts of unapproved technologies are costly. These so-called “shadow AI” incidents cost on average $670K more than standard incidents, according to IBM. Red Hat’s Gracely noted that while organizations can try to shut down these unapproved ports, they eventually have to figure out how to make them productive and secure — a task that requires a serious investment in an orchestration or platform layer. The dynamic defensive: MassMutual’s refusal to bet While some enterprise companies seek an "AI operating system" that oversees all of their AI technologies and apps, others are simply refusing to sign the check. Sears Merritt, CIO and head of enterprise technology at MassMutual, is managing the governance conundrum by intentionally staying in a state of high-velocity flexibility. "Things are so dynamic, it’s hard to know which of the AI vendors will end up on top," Merritt said at the Boston event. For that reason, MassMutual is refusing to enter any long-term contracts with AI vendors. Merritt’s strategy of “dynamic defensive” highlights a core finding of our research: Vendor popularity is changing radically month to month. Anthropic, for example, went from 0% in January to nearly 6% in February, in the number of respondents reporting what agent orchestration technology they were using. Again, the sample size was small, at 70 respondents. Still, even if directional, the dynamic landscape suggests picking a "primary" winner today is a fool’s errand. The January figure likely reflects survey composition: Respondents represent the broader enterprise market, not the developer community where Anthropic has seen its strongest early traction. Until recently, most organizations had signed up early with leaders like Microsoft and OpenAI as their main orchestration providers, due to their early lead with Copilot. Our finding that Anthropic is just now pushing into enterprise agent orchestration may be a confirmation of the recent excitement around that platform. One possible explanation is that enterprises already using Claude for model inference are now routing through Anthropic's native tooling rather than third-party frameworks — though the sample is too small to draw firm conclusions. The rise of “platform creep” The leading providers are also shifting toward "managed agents," as reflected by Anthropic’s recent announcement. This offering suggests possible continued platform creep, whereby providers like OpenAI and Anthropic take over more and more of the AI infrastructure — most specifically, in this case, the memory of agentic session details. And there the trap is set. Once your session data and orchestration live inside a provider's proprietary database, you aren't just using a model; you are living in its ecosystem. Moreover, persistent agent memory is a prime target for memory poisoning via injected instructions that influence every future interaction. And when that memory lives in a provider's database, you lose your own forensic capability. The security irony: The fox guarding the hen house We are seeing this platform creep in our data as well. The most jarring finding in our Q1 data is what we call the "Security Irony": the fact that the providers most responsible for creating enterprise AI risk are the same ones enterprises are using to manage it. Respondents said the top selection criterion for AI orchestration platforms was “security and permissions generally” (37.1%), beating out other criteria like cost, flexibility, control and ease of development. Yet, the market is choosing convenience over sovereignty. According to our survey, 26% of enterprises in February were using OpenAI as their primary security solution — the very same provider whose models create the risks they are trying to secure. That trend only seemed to strengthen in March, though, as stated before, we want to be careful. Our sample size is small, and this data should only be taken as directional. It’s not clear whether enterprises are choosing OpenAI as a security solution, or just relying on its built-in security features offered by Microsoft Azure (which partnered with OpenAI when it pushed its Copilot solution aggressively in 2024) because customers were already on that platform. Beyond the data, there are anecdotal signs that OpenAI's enterprise position may be shifting. Anthropic's Claude Code drew significant attention among developers early this year alongside the Claude 4.6 model. The subsequent announcement of Mythos, its security-focused model, prompted interest from enterprise security teams given its ability to identify vulnerabilities. OpenAI has also announced a security-focused model, GPT-5.4-Cyber. Our data may also point to a drop in OpenAI’s relative position in a few enterprise AI categories. One area was data-retrieval, where OpenAI again leads among third-party providers, but we saw an increase in the number of respondents instead using in-house solutions for retrieval — perhaps a sign that AI models and agents are getting better at natively being able to use tools to call directly to companies’ existing databases, and that custom code is often a way companies are building this in. However, here again we feel our data is at best directional for now. We are asking the fox to guard the hen house. Hyperscaler security features (like those from OpenAI, Azure, and Google) are winning, because they are already integrated into the platforms enterprises are using. But it creates a single-provider dependency. As agents gain the power to modify documents, call APIs and access databases, the “governance mirage" suggests we have control, while the data shows we are simply clicking "I agree" on whatever the hyperscalers offer. The resulting risks, however, include content injection, privilege escalation and data exfiltration. The path forward: toward a unified control plane The search for the "Dynatrace for AI" So, what is the way out? Sriraman argued that the industry desperately needs a "central observability platform" — a "Dynatrace for AI" — that provides full end-to-end visibility, including model drift and safety prompting, agent behavior analytics, privilege escalation alerts, and forensic logging. He is currently working with a number of potential providers to deliver on this. The “swivel chair” warning Sriraman warned that without a unified control plane, enterprises are at risk of sliding back into a fragmented "swivel chair" world — reminiscent of the early, inefficient days of Robotic Process Automation (RPA) — where employees are forced to constantly jump between different siloed AI tools to finish a single workflow. "We don’t want to create a world where you have to switch to do something here and then go back to the platform to do something else," he said. But that desire for a single control plane conflicts with the desire to avoid lock-in. Our data shows the market has settled on the “hybrid control plane.” In other words, the most popular situation among our respondents (at 34.3%), was to use model provider-native solutions like Copilot Studio or OpenAI assistants for some workflows, while also running external options like LangGraph or custom orchestration for others. Smaller numbers of companies reported being more dogmatic here, whether that be deliberately removing the model provider from the orchestration layer entirely, relying only on custom orchestration tools, or relying only on the model provider’s technology Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch. The bottom line: The “big red button” Visibility and integration are only half the battle. In a high-stakes industry like healthcare, Sriraman argues that any legitimate control plane must also offer a hard-stop capability. "We need a big red button," he said. "Kill it. We should be able to have that … without that, don't put anything in the operational setting." In fact, such a kill switch was formally called for by the security community group OWASP as part of a recommended security framework. The “governance mirage” is the belief that you can scale AI without deciding who owns the control and security plane. If you are one of the 72% of organizations claiming multiple "primary" platforms, be careful because you may not have a strategy; you may have a conflict of interest. It suggests that the winner of the war between the AI behemoths — OpenAI, Anthropic, Google, Microsoft, etc. — won’t necessarily be the one with the best model, but the one that manages to sit above the models and help enterprises enforce a single version of the truth. That may be difficult to achieve, though, given that companies won’t want lock-in with a single player. The data suggests enterprises are already resisting that outcome — and may need to formalize that resistance. Enterprises arguably need to own their control plane with independent security instrumentation, not wait for a vendor to win that role for them.