How MassMutual and Mass General Brigham turned AI pilot sprawl into production results
Our take

Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap — and what the results look like when discipline replaces sprawl.
At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two.
“We're always starting with why do we care about this problem?” Sears Merritt, MassMutual’s head of enterprise technology and experience, said at the event. “If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?”
Defining metrics, establishing strong feedback loops
MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business — customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas.
Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be “intractable in the business” due to factors like lack of data or access, or regulatory constraint.
“We won't go any further with an idea until we get crystal clear on how we're going to measure, and how we're going to define success.”
Ultimately, it’s up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners.
That starting point creates a quick feedback loop. “The things that we find slow us down is where there isn't shared clarity on what outcome we're trying to achieve,” which can lead to confusion and constant re-adjusting, said Merritt. “We don’t go to production until there is a business partner that says, ‘Yes, that works.’”
His team is strategic about evaluating emerging tools, and “extremely rigorous” when testing and measuring what "good" means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift.
Merritt also operates with a no-commitment policy — meaning the company doesn’t lock itself into using a particular model. It has what he calls an “incredibly heterogeneous” technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn't accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath — so when a better model comes along, swapping it in doesn't mean starting over.
Because, Merritt explained, “the best of breed today might be the worst of breed tomorrow, and we don't want to set ourselves up to fall behind.”
Weeding instead of letting a thousand flowers bloom
Mass General Brigham (MGB), for its part, took more of a spray and pray approach — at first.
Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan “Sri” Sriraman said at the same VB event.
But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, “we did follow the thousand flowers bloom [methodology], but we didn't have a thousand flowers, we had probably a few tens of flowers trying to bloom,” he said.
Like Merritt’s team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required.
Sriraman's team also spoke with their primary platform providers — Epic, Workday, ServiceNow, Microsoft — about their roadmaps. This was a “pivotal moment,” he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out).
As Sriraman put it: “Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it.”
That said, the marketplace is still nascent, which can make for difficult decisions. “The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You're gonna get six different answers.”
There's nothing wrong with that, he noted; it's just that everybody is discovering and experimenting as the landscape keeps shifting.
Instead of a wild West environment, Sriraman’s team distributes Microsoft Copilot to users across the business, and uses a “small landing zone” where they can safely test more sophisticated products and control token use.
They also began “consciously embedding AI champions“ across business groups. “This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing,” Sriraman said.
Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI “a little more pragmatically.” Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges.
In clinical settings, the guardrails are absolute: AI systems never issue the final decision. "There's always going to be a doctor or a physician assistant in the loop to close the decision," Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off.
Sriraman was clear: "Thou shall not do this: Don't show PHI [protected health information] in Perplexity. As simple as that, right?"
And, importantly, there must be safety mechanisms in place. “We need a big red button, kill it,” Sriraman emphasized. “We don’t put anything in the operational setting without that.”
Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn’t have to be dramatically different. “There is nothing new about this,” Sriraman said. “You can replace the word BPM [business process management] from the '90s and 2000s with AI. The same concepts apply.”
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Are we getting what we paid for? How to turn AI momentum into measurable valueEnterprise AI is entering a new phase — one where the central question is no longer what can be built, but how to make the most of our AI investment. At VentureBeat’s latest AI Impact Tour session, Brian Gracely, director of portfolio strategy at Red Hat, described the operational reality inside large organizations: AI sprawl, rising inference costs, and limited visibility into what those investments are actually returning. It’s the “Day 2” moment — when pilots give way to production, and cost, governance, and sustainability become harder than building the system in the first place. "We've seen customers who say, 'I have 50,000 licenses of Copilot. I don't really know what people are getting out of that. But I do know that I'm paying for the most expensive computing in the world, because it's GPUs,'" Gracely said. "'How am I going to get that under control?'" Why enterprise AI costs are now a board-level problem For much of the past two years, cost was not the primary concern for organizations evaluating generative AI. The experimental phase gave teams cover to spend freely, and the promise of productivity gains justified aggressive investment, but that dynamic is shifting as enterprises enter their second and third budget cycles with AI. The focus has moved from "can we build something?" to "are we getting what we paid for?" Enterprises that made large, early bets on managed AI services are conducting hard reviews of whether those investments are delivering measurable value. The issue isn’t just that GPU computing is expensive. It is that many organizations lack the instrumentation to connect spending to outcomes, making it nearly impossible to justify renewals or scale responsibly. The strategic shift from token consumer to token producer The dominant AI procurement model of the past few years has been straightforward: pay a vendor per token, per seat, or per API call, and let someone else manage the infrastructure. That model made sense as a starting point but is increasingly being questioned by organizations with enough experience to compare alternatives. Enterprises that have been through one AI cycle are starting to rethink that model. "Instead of being purely a token consumer, how can I start being a token generator?" Gracely said. "Are there use cases and workloads that make sense for me to own more? It may mean operating GPUs. It may mean renting GPUs. And then asking, 'Does that workload need the greatest state-of-the-art model? Are there more capable open models or smaller models that fit?'" The decision is not binary. The right answer depends on the workload, the organization, and the risk tolerance involved, but the math is getting more complicated as the number of capable open models, from DeepSeek to models now available through cloud marketplaces, grows. Now enterprises actually have real alternatives to the handful of providers that dominated the landscape two years ago. Falling AI costs and rising usage create a paradox for enterprise budgets Some enterprise leaders argue that locking into infrastructure investments now could mean significantly overpaying in the long run, pointing to the statement from Anthropic CEO Dario Amodei that AI inference costs are declining roughly 60% per year. The emergence of open-source models such as DeepSeek and others has meaningfully expanded the strategic options available to enterprises that are willing to invest in the underlying infrastructure in the last three years. But while costs per token are falling, usage is accelerating at a pace that more than offsets efficiency gains. It's a version of Jevons Paradox, the economic principle that improvements in resource efficiency tend to increase total consumption rather than reduce it, as lower cost enables broader adoption. For enterprise budget planners, this means declining unit costs do not translate into declining total bills. An organization that triples its AI usage while costs fall by half still ends up spending more than it did before. The consideration becomes which workloads genuinely require the most capable and most expensive models, and which can be handled just fine by smaller, cheaper alternatives. The business case for investing in AI infrastructure flexibility The prescription isn't to slow down AI investment, but to build with flexibility being top of mind. The organizations that will win aren't necessarily the ones that move fastest or spend the most; they're the ones building infrastructure and operating models capable of absorbing the next unexpected development. "The more you can build some abstractions and give yourself some flexibility, the more you can experiment without running up costs, but also without jeopardizing your business. Those are as important as asking whether you're doing everything best practice right now," Gracely explained. But despite how entrenched AI discussions have become in enterprise planning cycles, the practical experience most organizations have is still measured in years, not decades. "It feels like we've been doing this forever. We've been doing this for three years," Gracely added. "It's early and it's moving really fast. You don't know what's coming next. But the characteristics of what's coming next — you should have some sense of what that looks like.” For enterprise leaders still calibrating their AI investment strategies, that may be the most actionable takeaway: the goal is not to optimize for today's cost structure, but to build the organizational and technical flexibility to adapt when, not if, it changes again.
- The AI governance mirage: Why 72% of enterprises don’t have the control and security they think they doDecision makers at 72% of organizations claim to have two or more AI platforms that they identify as their "primary" layer, according to a survey of 40 enterprise companies conducted by VentureBeat last month, revealing real gaps in security and control. For enterprise management and technical leaders, and especially security leaders, these multiple AI platforms extend the attack surfaces of most enterprises at a time when AI-driven attacks have become increasingly potent. The multiple platforms — which include offerings from hyperscaler or AI labs like Microsoft Azure, Google, OpenAI or Anthropic, or big application companies like Epic, Workday or ServiceNow — reflect a state of sprawl that has emerged as these big software providers rush to offer their own AI to their enterprise customers. Those customers, in their own rush to scale AI, are finding they aren’t building a singular strategy — in fact they may be building a collection of contradictions. The strategic paradox: why leading enterprises are building around their vendors For example, take the strategic paradox faced by Mass General Brigham (MGB) hospital system, which has 90,000 employees and is the largest employer in Massachusetts. The hospital system last year had to shut down an uncontrolled number of internal proof of concepts that had sprouted up as employees had gotten carried away with AI projects, said CTO Nallan “Sri” Sriraman at the VentureBeat AI Impact event in Boston on March 26, which focused on the challenges of scaling AI. Instead, the company decided it was better to wait for the software giants it already uses to deliver on their AI roadmaps. Since these companies have so many resources, and were making AI a top priority themselves, it made no sense for MGB to try to build its own AI layer that would be duplicative, he said. "Why are we building it ourselves?" he asked. "Leverage it." Yet, even then, Sriraman’s team has been forced to build workarounds, where those companies haven’t done enough. For example, MGB has just completed a “full-scaled” custom build around Microsoft’s Copilot — to get essentially everything offered by that tool — by putting a "skin" around Copilot to handle the safety and data privacy concerns the major model providers haven't yet mastered. Specifically, MGB needed a way for employees to prompt the AI and not have their protected health information (PHI) leaked back to the Copilot LLM provider, OpenAI. The new secure platform, which can support up to 30,000 users, is really the ultimate contradiction: Even though the company has a mandate to leverage the AI provided by the bigger companies, it needs to build around its failures. The contradiction goes even further. These software vendors used by MGB — which also include Epic, Workday and ServiceNow — are all now building agents for their AI, all operating differently. So MGB has to invest in building a “control plane that coordinates and orchestrates all of these agents,” Sriraman said. “That’s where our investment is going to be.” He noted that companies like his are “discovering and experimenting as the landscape keeps shifting." The marketplace is "still nascent," he said, which makes decisions difficult. The "six blind men" problem Sriraman explained the current vendor landscape with an analogy: "When you ask six blind men to touch an elephant and say, what does this elephant look like?" Sriraman said. "You're gonna get six different answers." What emerges from the research VentureBeat conducted in the first quarter, along with conversations like the one in Boston, is a situation that we at VentureBeat are calling a “governance mirage.” While many enterprises say they have adequate governance, in reality they haven’t created clear accountability or specific guardrails, evaluations or security processes to ensure that governance. The data of disconnect: confidence vs. systematic oversight The research comes from surveys across January, February and March by VentureBeat of enterprise companies with 100 or more employees, with 40 to 70 qualified respondents per topic area — covering agentic orchestration, AI security, RAG and governance. The data lacks statistical significance in many areas and should be treated as directional. The research on governance found that a majority, or 56%, of respondents said they are “very confident” that they’d detect a misbehaving AI model, suggesting that most decision-makers believe they have sufficient basic governance at their companies. However, nearly a third of respondents have no systematic mechanism to detect AI misbehavior until it surfaces through users or audits. In a world where telemetry leakage accounts for 34% of GenAI incidents (Wiz), and the global average breach cost has hit $4.4M (IBM 2025 Cost of a Data Breach), finding out after the damage is done is the default for too many companies. Moreover, 43% of respondents say a central team owns AI governance. That sounds reassuring — until you look at what’s happening everywhere else. Twenty-three percent say governance is unclear or actively contested between teams. Twenty percent say each platform team governs independently. Six percent say no one has formally addressed it. The rest said they were unsure who owned it. More telling is the barrier data. When asked about the single biggest obstacle to governing AI across platforms, “no single owner or accountable team” ranked second at 29% — just behind vendor opacity. Accountability structure and lack of vendor transparency are the two dominant failure modes, and they compound each other: Without a central owner, no one has the mandate to demand transparency from the vendors. The day-two bill: managing sprawl, creep, and lock-in The scaling trap: Red Hat’s warning Brian Gracely, Senior Director at Red Hat, who also spoke at the VentureBeat Boston event last month, addressed the infrastructure side of this sprawl, warning that many enterprises are falling into a trap of deceptive initial wins. Gracely noted that the barrier to entry is almost nonexistent at the start, with nearly anyone able to spin up a project using a credit card and an API key. "Day zero is very, very easy," Gracely said. "Day two is when the bill comes due." Red Hat is positioning its software layer (OpenShift AI) as the necessary buffer to prevent enterprises from getting buried in a single provider's proprietary ecosystem. Gracely’s point is direct: If your control system is built entirely inside one cloud provider’s toolset, you are effectively "renting a cage." The illusion of speed in the early pilot phase often hides a technical debt that becomes obvious the moment you try to move your AI work to a different platform. Gracely illustrated this with a recent example. A senior leader from Red Hat’s centralized CTO office spent part of her vacation contributing to an open-source agent project called OpenClaw, which became widely popular in the first quarter. Within days of her name appearing as a project maintainer, Red Hat was fielding calls from major New York banks. Their problem was immediate: They realized they already had upwards of 10,000 employees bringing "claws" — agent-based tools — into their infrastructure with zero centralized oversight. Breaches caused by employees working on these sorts of unapproved technologies are costly. These so-called “shadow AI” incidents cost on average $670K more than standard incidents, according to IBM. Red Hat’s Gracely noted that while organizations can try to shut down these unapproved ports, they eventually have to figure out how to make them productive and secure — a task that requires a serious investment in an orchestration or platform layer. The dynamic defensive: MassMutual’s refusal to bet While some enterprise companies seek an "AI operating system" that oversees all of their AI technologies and apps, others are simply refusing to sign the check. Sears Merritt, CIO and head of enterprise technology at MassMutual, is managing the governance conundrum by intentionally staying in a state of high-velocity flexibility. "Things are so dynamic, it’s hard to know which of the AI vendors will end up on top," Merritt said at the Boston event. For that reason, MassMutual is refusing to enter any long-term contracts with AI vendors. Merritt’s strategy of “dynamic defensive” highlights a core finding of our research: Vendor popularity is changing radically month to month. Anthropic, for example, went from 0% in January to nearly 6% in February, in the number of respondents reporting what agent orchestration technology they were using. Again, the sample size was small, at 70 respondents. Still, even if directional, the dynamic landscape suggests picking a "primary" winner today is a fool’s errand. The January figure likely reflects survey composition: Respondents represent the broader enterprise market, not the developer community where Anthropic has seen its strongest early traction. Until recently, most organizations had signed up early with leaders like Microsoft and OpenAI as their main orchestration providers, due to their early lead with Copilot. Our finding that Anthropic is just now pushing into enterprise agent orchestration may be a confirmation of the recent excitement around that platform. One possible explanation is that enterprises already using Claude for model inference are now routing through Anthropic's native tooling rather than third-party frameworks — though the sample is too small to draw firm conclusions. The rise of “platform creep” The leading providers are also shifting toward "managed agents," as reflected by Anthropic’s recent announcement. This offering suggests possible continued platform creep, whereby providers like OpenAI and Anthropic take over more and more of the AI infrastructure — most specifically, in this case, the memory of agentic session details. And there the trap is set. Once your session data and orchestration live inside a provider's proprietary database, you aren't just using a model; you are living in its ecosystem. Moreover, persistent agent memory is a prime target for memory poisoning via injected instructions that influence every future interaction. And when that memory lives in a provider's database, you lose your own forensic capability. The security irony: The fox guarding the hen house We are seeing this platform creep in our data as well. The most jarring finding in our Q1 data is what we call the "Security Irony": the fact that the providers most responsible for creating enterprise AI risk are the same ones enterprises are using to manage it. Respondents said the top selection criterion for AI orchestration platforms was “security and permissions generally” (37.1%), beating out other criteria like cost, flexibility, control and ease of development. Yet, the market is choosing convenience over sovereignty. According to our survey, 26% of enterprises in February were using OpenAI as their primary security solution — the very same provider whose models create the risks they are trying to secure. That trend only seemed to strengthen in March, though, as stated before, we want to be careful. Our sample size is small, and this data should only be taken as directional. It’s not clear whether enterprises are choosing OpenAI as a security solution, or just relying on its built-in security features offered by Microsoft Azure (which partnered with OpenAI when it pushed its Copilot solution aggressively in 2024) because customers were already on that platform. Beyond the data, there are anecdotal signs that OpenAI's enterprise position may be shifting. Anthropic's Claude Code drew significant attention among developers early this year alongside the Claude 4.6 model. The subsequent announcement of Mythos, its security-focused model, prompted interest from enterprise security teams given its ability to identify vulnerabilities. OpenAI has also announced a security-focused model, GPT-5.4-Cyber. Our data may also point to a drop in OpenAI’s relative position in a few enterprise AI categories. One area was data-retrieval, where OpenAI again leads among third-party providers, but we saw an increase in the number of respondents instead using in-house solutions for retrieval — perhaps a sign that AI models and agents are getting better at natively being able to use tools to call directly to companies’ existing databases, and that custom code is often a way companies are building this in. However, here again we feel our data is at best directional for now. We are asking the fox to guard the hen house. Hyperscaler security features (like those from OpenAI, Azure, and Google) are winning, because they are already integrated into the platforms enterprises are using. But it creates a single-provider dependency. As agents gain the power to modify documents, call APIs and access databases, the “governance mirage" suggests we have control, while the data shows we are simply clicking "I agree" on whatever the hyperscalers offer. The resulting risks, however, include content injection, privilege escalation and data exfiltration. The path forward: toward a unified control plane The search for the "Dynatrace for AI" So, what is the way out? Sriraman argued that the industry desperately needs a "central observability platform" — a "Dynatrace for AI" — that provides full end-to-end visibility, including model drift and safety prompting, agent behavior analytics, privilege escalation alerts, and forensic logging. He is currently working with a number of potential providers to deliver on this. The “swivel chair” warning Sriraman warned that without a unified control plane, enterprises are at risk of sliding back into a fragmented "swivel chair" world — reminiscent of the early, inefficient days of Robotic Process Automation (RPA) — where employees are forced to constantly jump between different siloed AI tools to finish a single workflow. "We don’t want to create a world where you have to switch to do something here and then go back to the platform to do something else," he said. But that desire for a single control plane conflicts with the desire to avoid lock-in. Our data shows the market has settled on the “hybrid control plane.” In other words, the most popular situation among our respondents (at 34.3%), was to use model provider-native solutions like Copilot Studio or OpenAI assistants for some workflows, while also running external options like LangGraph or custom orchestration for others. Smaller numbers of companies reported being more dogmatic here, whether that be deliberately removing the model provider from the orchestration layer entirely, relying only on custom orchestration tools, or relying only on the model provider’s technology Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch. The bottom line: The “big red button” Visibility and integration are only half the battle. In a high-stakes industry like healthcare, Sriraman argues that any legitimate control plane must also offer a hard-stop capability. "We need a big red button," he said. "Kill it. We should be able to have that … without that, don't put anything in the operational setting." In fact, such a kill switch was formally called for by the security community group OWASP as part of a recommended security framework. The “governance mirage” is the belief that you can scale AI without deciding who owns the control and security plane. If you are one of the 72% of organizations claiming multiple "primary" platforms, be careful because you may not have a strategy; you may have a conflict of interest. It suggests that the winner of the war between the AI behemoths — OpenAI, Anthropic, Google, Microsoft, etc. — won’t necessarily be the one with the best model, but the one that manages to sit above the models and help enterprises enforce a single version of the truth. That may be difficult to achieve, though, given that companies won’t want lock-in with a single player. The data suggests enterprises are already resisting that outcome — and may need to formalize that resistance. Enterprises arguably need to own their control plane with independent security instrumentation, not wait for a vendor to win that role for them.