The three disciplines separating AI agent demos from real-world deployment
Our take

Getting AI agents to perform reliably in production — not just in demos — is turning out to be harder than enterprises anticipated. Fragmented data, unclear workflows, and runaway escalation rates are slowing deployments across industries.
“The technology itself often works well in demonstrations,” said Sanchit Vir Gogia, chief analyst with Greyhound Research. “The challenge begins when it is asked to operate inside the complexity of a real organization.”
Burley Kawasaki, who oversees agent deployment at Creatio, and team have developed a methodology built around three disciplines: data virtualization to work around data lake delays; agent dashboards and KPIs as a management layer; and tightly bounded use-case loops to drive toward high autonomy.
In simpler use cases, Kawasaki says these practices have enabled agents to handle up to 80-90% of tasks on their own. With further tuning, he estimates they could support autonomous resolution in at least half of use cases, even in more complex deployments.
“People have been experimenting a lot with proof of concepts, they've been putting a lot of tests out there,” Kawasaki told VentureBeat. “But now in 2026, we’re starting to focus on mission-critical workflows that drive either operational efficiencies or additional revenue.”
Why agents keep failing in production
Enterprises are eager to adopt agentic AI in some form or another — often because they're afraid to be left out, even before they even identify real-world tangible use cases — but run into significant bottlenecks around data architecture, integration, monitoring, security, and workflow design.
The first obstacle almost always has to do with data, Gogia said. Enterprise information rarely exists in a neat or unified form; it is spread across SaaS platforms, apps, internal databases, and other data stores. Some are structured, some are not.
But even when enterprises overcome the data retrieval problem, integration is a big challenge. Agents rely on APIs and automation hooks to interact with applications, but many enterprise systems were designed long before this kind of autonomous interaction was a reality, Gogia pointed out.
This can result in incomplete or inconsistent APIs, and systems can respond unpredictably when accessed programmatically. Organizations also run into snags when they attempt to automate processes that were never formally defined, Gogia said.
“Many business workflows depend on tacit knowledge,” he said. That is, employees know how to resolve exceptions they’ve seen before without explicit instructions — but, those missing rules and instructions become startlingly obvious when workflows are translated into automation logic.
The tuning loop
Creatio deploys agents in a “bounded scope with clear guardrails,” followed by an “explicit” tuning and validation phase, Kawasaki explained. Teams review initial outcomes, adjust as needed, then re-test until they’ve reached an acceptable level of accuracy.
That loop typically follows this pattern:
Design-time tuning (before go-live): Performance is improved through prompt engineering, context wrapping, role definitions, workflow design, and grounding in data and documents.
Human-in-the-loop correction (during execution): Devs approve, edit, or resolve exceptions. In instances where humans have to intervene the most (escalation or approval), users establish stronger rules, provide more context, and update workflow steps; or, they’ll narrow tool access.
Ongoing optimization (after go-live): Devs continue to monitor exception rates and outcomes, then tune repeatedly as needed, helping to improve accuracy and autonomy over time.
Kawasaki’s team applies retrieval-augmented generation to ground agents in enterprise knowledge bases, CRM data, and other proprietary sources.
Once agents are deployed in the wild, they are monitored with a dashboard providing performance analytics, conversion insights, and auditability. Essentially, agents are treated like digital workers. They have their own management layer with dashboards and KPIs.
For instance, an onboarding agent will be incorporated as a standard dashboard interface providing agent monitoring and telemetry. This is part of the platform layer — orchestration, governance, security, workflow execution, monitoring, and UI embedding — that sits "above the LLM," Kawasaki said.
Users see a dashboard of agents in use and each of their processes, workflows, and executed results. They can “drill down” into an individual record (like a referral or renewal) that shows a step-by-step execution log and related communications to support traceability, debugging, and agent tweaking. The most common adjustments involve logic and incentives, business rules, prompt context, and tool access, Kawasaki said.
The biggest issues that come up post-deployment:
Exception handling volume can be high: Early spikes in edge cases often occur until guardrails and workflows are tuned.
Data quality and completeness: Missing or inconsistent fields and documents can cause escalations; teams can identify which data to prioritize for grounding and which checks to automate.
Auditability and trust: Regulated customers, particularly, require clear logs, approvals, role-based access control (RBAC), and audit trails.
“We always explain that you have to allocate time to train agents,” Creatio’s CEO Katherine Kostereva told VentureBeat. “It doesn't happen immediately when you switch on the agent, it needs time to understand fully, then the number of mistakes will decrease.”
"Data readiness" doesn’t always require an overhaul
When looking to deploy agents, “Is my data ready?,” is a common early question. Enterprises know data access is important, but can be turned off by a massive data consolidation project.
But virtual connections can allow agents access to underlying systems and get around typical data lake/lakehouse/warehouse delays. Kawasaki’s team built a platform that integrates with data, and is now working on an approach that will pull data into a virtual object, process it, and use it like a standard object for UIs and workflows. This way, they don’t have to “persist or duplicate” large volumes of data in their database.
This technique can be helpful in areas like banking, where transaction volumes are simply too large to copy into CRM, but are “still valuable for AI analysis and triggers,” Kawasaki said.
Once integrations and virtual objects are established, teams can evaluate data completeness, consistency, and availability, and identify low-friction starting points (like document-heavy or unstructured workflows).
Kawasaki emphasized the importance of “really using the data in the underlying systems, which tends to actually be the cleanest or the source of truth anyway.”
Matching agents to the work
The best fit for autonomous (or near-autonomous) agents are high-volume workflows with “clear structure and controllable risk,” Kawasaki said. For instance, document intake and validation in onboarding or loan preparation, or standardized outreach like renewals and referrals.
“Especially when you can link them to very specific processes inside an industry — that's where you can really measure and deliver hard ROI,” he said.
For instance, financial institutions are often siloed by nature. Commercial lending teams perform in their own environment, wealth management in another. But an autonomous agent can look across departments and separate data stores to identify, for instance, commercial customers who might be good candidates for wealth management or advisory services.
“You think it would be an obvious opportunity, but no one is looking across all the silos,” Kawasaki said. Some banks that have applied agents to this very scenario have seen “benefits of millions of dollars of incremental revenue,” he claimed, without naming specific institutions.
However, in other cases — particularly in regulated industries — longer-context agents are not only preferable, but necessary. For instance, in multi-step tasks like gathering evidence across systems, summarizing, comparing, drafting communications, and producing auditable rationales.
“The agent isn't giving you a response immediately,” Kawasaki said. “It may take hours, days, to complete full end-to-end tasks.”
This requires orchestrated agentic execution rather than a “single giant prompt,” he said. This approach breaks work down into deterministic steps to be performed by sub-agents. Memory and context management can be maintained across various steps and time intervals. Grounding with RAG can help keep outputs tied to approved sources, and users have the ability to dictate expansion to file shares and other document repositories.
This model typically doesn’t require custom retraining or a new foundation model. Whatever model enterprises use (GPT, Claude, Gemini), performance improves through prompts, role definitions, controlled tools, workflows, and data grounding, Kawasaki said.
The feedback loop puts “extra emphasis” on intermediate checkpoints, he said. Humans review intermediate artifacts (such as summaries, extracted facts, or draft recommendations) and correct errors. Those can then be converted into better rules and retrieval sources, narrower tool scopes, and improved templates.
“What is important for this style of autonomous agent, is you mix the best of both worlds: The dynamic reasoning of AI, with the control and power of true orchestration,” Kawasaki said.
Ultimately, agents require coordinated changes across enterprise architecture, new orchestration frameworks, and explicit access controls, Gogia said. Agents must be assigned identities to restrict their privileges and keep them within bounds. Observability is critical; monitoring tools can record task completion rates, escalation events, system interactions, and error patterns. This kind of evaluation must be a permanent practice, and agents should be tested to see how they react when encountering new scenarios and unusual inputs.
“The moment an AI system can take action, enterprises have to answer several questions that rarely appear during copilot deployments,” Gogia said. Such as: What systems is the agent allowed to access? What types of actions can it perform without approval? Which activities must always require a human decision? How will every action be recorded and reviewed?
“Those [enterprises] that underestimate the challenge often find themselves stuck in demonstrations that look impressive but cannot survive real operational complexity,” Gogia said.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Designing the agentic AI enterprise for measurable performancePresented by Edgeverve Smart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes clear goals, data‑driven workflows, and an enterprise platform that balances autonomy, governance, observability, and flexibility with hard guardrails from day one. From pilots to the “operational grey zones” The next wave of value sits in the connective tissue between applications — those operational grey zones where handoffs, reconciliations, approvals, and data lookups still rely on humans. Assigning agents to these paths means collapsing system boundaries, applying intelligence to context, and re‑imagining processes that were never formally automated. Many pilots stall because they start as lab experiments rather than outcome‑anchored designs tied to production systems, controls, and KPIs. Start with outcomes, not algorithms. Translate organizational KPIs (cash‑flow, DSO, SLA adherence, compliance hit rates, MTTR, NPS, claims leakage, etc.) into agent goals, then cascade them into single‑agent and multi‑agent objectives. Only after goals are explicit should you select workflows and decompose tasks. Pick targets, then decompose the work What does “target” actually mean? In agentic programs, a target is a business outcome and the use case that moves it. For example, “reduce unapplied cash by 20%” target outcome; “cash application and exceptions handling” use case. With the use case in hand, perform persona‑level task decomposition: map the human role (e.g., cash applications analyst, facilities coordinator), enumerate their tasks, and identify which are ripe for agentification (data retrieval, matching, policy checks, decision proposals, transaction initiation). Delivering on those tasks requires a data‑embedded workflow fabric that can read, write, and reason across enterprise systems while honoring permissions. Data must be AI‑ready, discoverable, governed, labeled where needed, augmented for retrieval (RAG), and policy‑protected for PII, PCI, and regulatory constraints. Integration goes beyond APIs APIs are one mode of integration, not the only one. Robust agent execution typically blends: Stable APIs with lifecycle management for core systems Event‑driven triggers (streams, webhooks, CDC) to react in real time UI/RPA fallbacks where APIs don’t exist Search/RAG connectors for documents and knowledge bases Policy management across tools and actions to enforce entitlements and segregation of duties The north star is integration reliability — built on idempotency, retries, circuit-breakers, and standardized tool schemas — so agents don’t “hallucinate” actions the enterprise can’t verify. A quick example: finance and facilities, in production Inside our organization, we deployed specialized agents in a live CFO environment and in building maintenance. In finance, seven agents interacted with production systems and real accountability structures. Year‑one outcomes included: >3% monthly cash‑flow improvement, 50% productivity gain in affected workflows, 90% faster onboarding, a shift from account‑level handling to function‑level orchestration, and a $32M cash‑flow lift. These results don’t guarantee gains everywhere; they show that designing products can deliver measurable outcomes on a scale. The four design pillars: Autonomy, governance, observability & evals, flexibility 1) Autonomy: right‑size it to the risk Autonomy exists on a spectrum. Early efforts often automate well‑bounded tasks; others pursue research/analysis agents; increasingly, teams target mission‑critical transactional agents (payments, vendor onboarding, pricing changes). The rule: match autonomy to risk, and encode the operating mode suggest‑only, propose‑and‑approve, or execute‑with‑rollback per task. 2) Governance: guardrails by design, not as bolt‑ons Unbounded agents create unacceptable risk. Build guardrails into the plan: Policy & permissions: tie tools/actions to identity, scopes, and SoD rules. Human‑in‑the‑loop (HITL): where mission‑critical thresholds are crossed (amount, vendor risk, regulatory exposure). Agent lifecycle management: versioning, change control, regression gates, approval workflows, and sunsetting. Third‑party agent orchestration: vet external agents like vendors, capabilities, scopes, logs, SLAs. Incident and rollback: kill‑switches, safe‑mode, and compensating transactions. This is how you scale innovation safely while protecting brand, compliance, and customers. 3) Observability & evaluations: trust comes from telemetry Production agents need the same rigor as any core platform: Telemetry: capture full execution traces across perception, planning, tool use, action supported by structured logs and replay. Offline evals: cenario tests, red‑teaming, bias and safety checks, cost/performance benchmarks; baseline vs. challenger comparisons. Online evals: shadow mode, A/B, canary releases, guardrail breach alerts, human feedback loops. Explainability & auditability: why was an action taken, which data/tools were used, and who approved. 4) Flexibility: assume volatility, design for swap‑ability Models, tools, and vendors change fast. Treat agentic capability as platform currency: create an environment where teams can evaluate, select, and swap models/tools without tearing down the build. Use a model router, tool registry, and contract‑first interfaces so upgrades are controlled experiments, not rewrites. The agent platform fabric: how platformization turns goals into outcomes A true agentic enterprise requires a platform fabric that transforms goals into outcomes, not a patchwork of isolated pilots. This platform anchors enterprise‑to‑agent KPI cascades, drives task decomposition and multi‑agent planning, and provides governed tooling and data access across APIs, RPA, search, and databases. It centralizes knowledge and memory through RAG and vector stores, enforces enterprise controls via a policy engine, and manages performance and safety through a unified model layer. It supports robust orchestration of first‑ and third‑party agents with common context, embeds deep observability and evaluation pipelines, and applies disciplined release engineering from sandbox to GA. Finally, it ensures long‑term resilience through lifecycle management versioning, deprecation, incident playbooks, and auditable histories. Guardrails in action: a BFSI example Consider payments exception handling in banking — high stakes, regulated, and customer‑visible. An agent proposes a resolution (e.g., auto‑reconcile or escalate) only when: The transaction falls below risk thresholds; above them, it triggers HITL approval. All policy checks (KYC/AML, velocity, sanctions) pass. Observability hooks record rationale, tools invoked, and data used. Rollback/compensation is defined if downstream failures occur. This pattern generalizes to vendor onboarding, pricing overrides, or claims adjudication — mission‑critical work with explicit safety rails. Scale beyond pilots Scaling agentic AI beyond pilots demands disciplined readiness across nine fronts: leaders must clarify which KPIs matter and how agent goals ladder into them, determine which persona tasks are agentified versus remain human‑led, and align each with the right autonomy mode from suggest‑only to propose‑and‑approve to execute‑with‑rollback. They must embed governance guardrails, including HITL points and lifecycle controls; ensure robust observability and evaluation via telemetry, replay, audits, and offline/online tests; and verify data readiness, with governed, policy‑protected, retrieval‑augmented data flows. Integration must be reliable, with API lifecycle management, event triggers, and RPA/other fallbacks. The underlying platform should enable model swap‑ability and orchestration of first‑ and third‑party agents without rebuilding. Finally, measurement must focus on true operational impact cash flow, cycle times, quality, and risk reduction rather than task counts. The takeaway Agentic AI is not a shortcut; it’s a new system of work. Enterprises that approach it with platform discipline aligning autonomy with risk, embedding governance and observability, and designing for swap‑ability will convert pilots into production impact. Those that don’t keep accumulating impressive but disconnected demos. The difference isn’t how fast you ship an agent; it’s how deliberately you design the enterprise around it. N. Shashidar is SVP & Global Head, Product Management at EdgeVerve. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
- Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AIEnterprise AI teams are hitting a wall — not because their models can't reason, but because the workflows underneath them were never built for agents. Tasks fail, handoffs break, and the problem compounds as organizations push agents deeper into back-office systems. A new architectural layer is emerging to address it: workflow execution control planes that impose deterministic structure on processes agents are expected to run. One of the companies bringing this to the forefront is Salesforce, with a new workflow platform that turns back-office workflows into a set of tasks for specialized agents to complete. Users can upload their processes or use one of the set Blueprints provided by Salesforce, and Agentforce Operations will break it down for agents. Salesforce senior vice president of Product, Sanjna Parulekar, told VentureBeat in an interview that the problem is that many enterprise workflows are not built for agents. “What we’ve observed with customers is that a lot of times, the brokenness in a process is probably in your product requirements document,” Parulekar said. “So when that’s uploaded into a product, it doesn’t quite work. We can optimize it and cut out some things and replace it with an agent.” Without this control panel layer, enterprises could risk deploying agents that increase cost rather than fix their workflow problems. Making the workflow work for agents, not just humans Enterprises deploying agents are learning a costly lesson: Their workflows were designed around human judgment gaps, not machine execution. Processes that evolved through years of workarounds — loosely defined steps, implicit decisions, coordination that depends on individuals knowing what to do next — break when agents are asked to follow them literally. Even with all of an enterprise’s context at its fingertips, AI systems will have difficulty completing tasks if it is not clear what it’s supposed to do. Parulekar said her team found that focusing on what makes the process tick and breaking it down into more explicit steps and workflows makes the system more deterministic. Then, when platforms like Agentforce Operations introduce agents, those agents already know their specific tasks. “It forces companies to rethink their processes and introduces observability into the mix because of the session tracing model in the system,” she said. Parulekar said human checks can be built into the system, so the process is more transparent. What makes this approach different from other workflow automation offerings is that it doesn’t rely on agents to decide what to do next; the system does. Unlike more traditional automation tools that route tasks and agents on probabilistic decision-making, this enforces execution on a more pre-defined, deterministic structure. The problem it introduces Codifying a workflow doesn't fix a broken one. If a process has flawed steps, encoding it for agents locks in the problem at scale. And once workflows are distributed across agents, the challenge shifts from execution to governance: who owns the process, who validates it, and how it evolves when business conditions change. It puts the onus on teams to take a hard look at what works for them and what doesn’t. Organizations need to consider that, along with the execution control plane offered by platforms like Agentforce Operations, someone should be made responsible for task completion and success. Brandon Metcalf, founder and CEO of workforce orchestration company Asymbl, told VentureBeat in a separate interview that the key to both humans and agents following a workflow is a shared goal. “You have to understand the goal or the agent or human won’t complete the task successfully,” Metcalf said. “Someone has to manage that outcome that has to be delivered. It can be a person or an agent.” The bottleneck has moved. As Metcalf framed it, the question is no longer whether agents can reason through a task, it's whether the workflow underneath them is coherent enough to execute. For enterprises that built their processes around human judgment and institutional memory, that's a harder fix than swapping in a smarter model.
- Salesforce launches Headless 360 to turn its entire platform into infrastructure for AI agentsSalesforce on Wednesday unveiled the most ambitious architectural transformation in its 27-year history, introducing "Headless 360" — a sweeping initiative that exposes every capability in its platform as an API, MCP tool, or CLI command so AI agents can operate the entire system without ever opening a browser. The announcement, made at the company's annual TDX developer conference in San Francisco, ships more than 100 new tools and skills immediately available to developers. It marks a decisive response to the existential question hanging over enterprise software: In a world where AI agents can reason, plan, and execute, does a company still need a CRM with a graphical interface? Salesforce's answer: No — and that's exactly the point. "We made a decision two and a half years ago: Rebuild Salesforce for agents," the company said in its announcement. "Instead of burying capabilities behind a UI, expose them so the entire platform will be programmable and accessible from anywhere." The timing is anything but coincidental. Salesforce finds itself navigating one of the most turbulent periods in enterprise software history — a sector-wide sell-off that has pushed the iShares Expanded Tech-Software Sector ETF down roughly 28% from its September peak. The fear driving the decline: that AI, particularly large language models from Anthropic, OpenAI, and others, could render traditional SaaS business models obsolete. Jayesh Govindarjan, EVP of Salesforce and one of the key architects behind the Headless 360 initiative, described the announcement as rooted not in marketing theory but in hard-won lessons from deploying agents with thousands of enterprise customers. "The problem that emerged is the lifecycle of building an agentic system for every one of our customers on any stack, whether it's ours or somebody else's," Govindarjan told VentureBeat in an exclusive interview. "The challenge that they face is very much the software development challenge. How do I build an agent? That's only step one." More than 100 new tools give coding agents full access to the Salesforce platform for the first time Salesforce Headless 360 rests on three pillars that collectively represent the company's attempt to redefine what an enterprise platform looks like in the agentic era. The first pillar — build any way you want — delivers more than 60 new MCP (Model Context Protocol) tools and 30-plus preconfigured coding skills that give external coding agents like Claude Code, Cursor, Codex, and Windsurf complete, live access to a customer's entire Salesforce org, including data, workflows, and business logic. Developers no longer need to work inside Salesforce's own IDE. They can direct AI coding agents from any terminal to build, deploy, and manage Salesforce applications. Agentforce Vibes 2.0, the company's own native development environment, now includes what it calls an "open agent harness" supporting both the Anthropic agent SDK and the OpenAI agents SDK. As demonstrated during the keynote, developers can choose between Claude Code and OpenAI agents depending on the task, with the harness dynamically adjusting available capabilities based on the selected agent. The environment also adds multi-model support, including Claude Sonnet and GPT-5, along with full org awareness from the start. A significant technical addition is native React support on the Salesforce platform. During the keynote demo, presenters built a fully functional partner service application using React — not Salesforce's own Lightning framework — that connected to org metadata via GraphQL while inheriting all platform security primitives. This opens up dramatically more expressive front-end possibilities for developers who want complete control over the visual layer. The second pillar — deploy on any surface — centers on the new Agentforce Experience Layer, which separates what an agent does from how it appears, rendering rich interactive components natively across Slack, mobile apps, Microsoft Teams, ChatGPT, Claude, Gemini, and any client supporting MCP apps. During the keynote, presenters defined an experience once and deployed it across six different surfaces without writing surface-specific code. The philosophical shift is significant: rather than pulling customers into a Salesforce UI, enterprises push branded, interactive agent experiences into whatever workspace their customers already inhabit. The third pillar — build agents you can trust at scale — introduces an entirely new suite of lifecycle management tools spanning testing, evaluation, experimentation, observation, and orchestration. Agent Script, the company's new domain-specific language for defining agent behavior deterministically, is now generally available and open-sourced. A new Testing Center surfaces logic gaps and policy violations before deployment. Custom Scoring Evals let enterprises define what "good" looks like for their specific use case. And a new A/B Testing API enables running multiple agent versions against real traffic simultaneously. Why enterprise customers kept breaking their own AI agents — and how Salesforce redesigned its tooling in response Perhaps the most technically significant — and candid — portion of VentureBeat's interview with Govindarjan addressed the fundamental engineering tension at the heart of enterprise AI: agents are probabilistic systems, but enterprises demand deterministic outcomes. Govindarjan explained that early Agentforce customers, after getting agents into production through "sheer hard work," discovered a painful reality. "They were afraid to make changes to these agents, because the whole system was brittle," he said. "You make one change and you don't know whether it's going to work 100% of the time. All the testing you did needs to be redone." This brittleness problem drove the creation of Agent Script, which Govindarjan described as a programming language that "brings together the determinism that's in programming languages with the inherent flexibility in probabilistic systems that LLMs provide." The language functions as a single flat file — versionable, auditable — that defines a state machine governing how an agent behaves. Within that machine, enterprises specify which steps must follow explicit business logic and which can reason freely using LLM capabilities. Salesforce open-sourced Agent Script this week, and Govindarjan noted that Claude Code can already generate it natively because of its clean documentation. The approach stands in sharp contrast to the "vibe coding" movement gaining traction elsewhere in the industry. As the Wall Street Journal recently reported, some companies are now attempting to vibe-code entire CRM replacements — a trend Salesforce's Headless 360 directly addresses by making its own platform the most agent-friendly substrate available. Govindarjan described the tooling as a product of Salesforce's own internal practice. "We needed these tools to make our customers successful. Then our FDEs needed them. We hardened them, and then we gave them to our customers," he told VentureBeat. In other words, Salesforce productized its own pain. Inside the two competing AI agent architectures Salesforce says every enterprise will need Govindarjan drew a revealing distinction between two fundamentally different agentic architectures emerging in the enterprise — one for customer-facing interactions and one he linked to what he called the "Ralph Wiggum loop." Customer-facing agents — those deployed to interact with end customers for sales or service — demand tight deterministic control. "Before customers are willing to put these agents in front of their customers, they want to make sure that it follows a certain paradigm — a certain brand set of rules," Govindarjan told VentureBeat. Agent Script encodes these as a static graph — a defined funnel of steps with LLM reasoning embedded within each step. The "Ralph Wiggum loop," by contrast, represents the opposite end of the spectrum: a dynamic graph that unrolls at runtime, where the agent autonomously decides its next step based on what it learned in the previous step, killing dead-end paths and spawning new ones until the task is complete. This architecture, Govindarjan said, manifests primarily in employee-facing scenarios — developers using coding agents, salespeople running deep research loops, marketers generating campaign materials — where an expert human reviews the output before it ships. "Ralph Wiggum loops are great for employee-facing because employees are, in essence, experts at something," Govindarjan explained. "Developers are experts at development, salespeople are experts at sales." The critical technical insight: both architectures run on the same underlying platform and the same graph engine. "This is a dynamic graph. This is a static graph," he said. "It's all a graph underneath." That unified runtime — spanning the spectrum from tightly controlled customer interactions to free-form autonomous loops — may be Salesforce's most important technical bet, sparing enterprises from maintaining separate platforms for different agent modalities. Salesforce hedges its bets on MCP while opening its ecosystem to every major AI model and tool Salesforce's embrace of openness at TDX was striking. The platform now integrates with OpenAI, Anthropic, Google Gemini, Meta's LLaMA, and Mistral AI models. The open agent harness supports third-party agent SDKs. MCP tools work from any coding environment. And the new AgentExchange marketplace unifies 10,000 Salesforce apps, 2,600-plus Slack apps, and 1,000-plus Agentforce agents, tools, and MCP servers from partners including Google, Docusign, and Notion, backed by a new $50 million AgentExchange Builders Initiative. Yet Govindarjan offered a surprisingly candid assessment of MCP itself — the protocol Anthropic created that has become a de facto standard for agent-tool communication. "To be very honest, not at all sure" that MCP will remain the standard, he told VentureBeat. "When MCP first came along as a protocol, a lot of us engineers felt that it was a wrapper on top of a really well-written CLI — which now it is. A lot of people are saying that maybe CLI is just as good, if not better." His approach: pragmatic flexibility. "We're not wedded to one or the other. We just use the best, and often we will offer all three. We offer an API, we offer a CLI, we offer an MCP." This hedging explains the "Headless 360" naming itself — rather than betting on a single protocol, Salesforce exposes every capability across all three access patterns, insulating itself against protocol shifts. Engine, the B2B travel management company featured prominently in the keynote demos, offered a real-world proof point for the open ecosystem approach. The company built its customer service agent, Ava, in 12 days using Agentforce and now handles 50% of customer cases autonomously. Engine runs five agents across customer-facing and employee-facing functions, with Data 360 at the heart of its infrastructure and Slack as its primary workspace. "CSAT goes up, costs to deliver go down. Customers are happier. We're getting them answers faster. What's the trade off? There's no trade off," an Engine executive said during the keynote. Underpinning all of it is a shift in how Salesforce gets paid. The company is moving from per-seat licensing to consumption-based pricing for Agentforce — a transition Govindarjan described as "a business model change and innovation for us." It's a tacit acknowledgment that when agents, not humans, are doing the work, charging per user no longer makes sense. Salesforce isn't defending the old model — it's dismantling it and betting the company on what comes next Govindarjan framed the company's evolution in architectural terms. Salesforce has organized its platform around four layers: a system of context (Data 360), a system of work (Customer 360 apps), a system of agency (Agentforce), and a system of engagement (Slack and other surfaces). Headless 360 opens every layer via programmable endpoints. "What you saw today, what we're doing now, is we're opening up every single layer, right, with MCP tools, so we can go build the agentic experiences that are needed," Govindarjan told VentureBeat. "I think you're seeing a company transforming itself." Whether that transformation succeeds will depend on execution across thousands of customer deployments, the staying power of MCP and related protocols, and the fundamental question of whether incumbent enterprise platforms can move fast enough to remain relevant when AI agents can increasingly build new systems from scratch. The software sector's bear market, the financial pressures bearing down on the entire industry, and the breathtaking pace of LLM improvement all conspire to make this one of the highest-stakes bets in enterprise technology. But there is an irony embedded in Salesforce's predicament that Headless 360 makes explicit. The very AI capabilities that threaten to displace traditional software are the same capabilities that Salesforce now harnesses to rebuild itself. Every coding agent that could theoretically replace a CRM is now, through Headless 360, a coding agent that builds on top of one. The company is not arguing that agents won't change the game. It's arguing that decades of accumulated enterprise data, workflows, trust layers, and institutional logic give it something no coding agent can generate from a blank prompt. As Benioff declared on CNBC's Mad Money in March: "The software industry is still alive, well and growing." Headless 360 is his company's most forceful attempt to prove him right — by tearing down the walls of the very platform that made Salesforce famous and inviting every agent in the world to walk through the front door. Parker Harris, Salesforce's co-founder, captured the bet most succinctly in a question he posed last month: "Why should you ever log into Salesforce again?" If Headless 360 works as designed, the answer is: You shouldn't have to. And that, Salesforce is wagering, is precisely what will keep you paying for it.
- Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)Look, we've spent the last 18 months building production AI systems, and we'll tell you what keeps us up at night — and it's not whether the model can answer questions. That's table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo'd a config file. We've moved past the era of "ChatGPT wrappers" (thank God), but the industry still treats autonomous agents like they're just chatbots with API access. They're not. When you give an AI system the ability to take actions without human confirmation, you're crossing a fundamental threshold. You're not building a helpful assistant anymore — you're building something closer to an employee. And that changes everything about how we need to engineer these systems. The autonomy problem nobody talks about Here's what's wild: We've gotten really good at making models that *sound* confident. But confidence and reliability aren't the same thing, and the gap between them is where production systems go to die. We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted "let's push this if we need to" in a Slack message as an actual directive. The model wasn't wrong in its interpretation — it was plausible. But plausible isn't good enough when you're dealing with autonomy. That incident taught us something crucial: The challenge isn't building agents that work most of the time. It's building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes. What reliability actually means for autonomous systems Layered reliability architecture When we talk about reliability in traditional software engineering, we've got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions. Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you're dealing with probabilistic systems making judgment calls. A bug isn't just a logic error—it's the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent. So what does reliability look like here? In our experience, it's a layered approach. Layer 1: Model selection and prompt engineering This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don't fool yourself into thinking that a great prompt is enough. I've seen too many teams ship "GPT-4 with a really good system prompt" and call it enterprise-ready. Layer 2: Deterministic guardrails Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn't? Is the action within acceptable parameters? We're talking old-school validation logic — regex, schema validation, allowlists. It's not sexy, but it's effective. One pattern that's worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don't just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong. Layer 3: Confidence and uncertainty quantification Here's where it gets interesting. We need agents that know what they don't know. We've been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: "I'm interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean..." This doesn't prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation. Layer 4: Observability and auditability Action Validation Pipeline If you can't debug it, you can't trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just "what action did it take" but "what was it thinking, what data did it consider, what was the reasoning chain?" We've built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It's verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement. Guardrails: The art of saying no Let's talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — "we'll add some safety checks if we need them." That's backwards. Guardrails should be your starting point. We think of guardrails in three categories. Permission boundaries What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what's the maximum damage it can cause? We use a principle called "graduated autonomy." New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits. One technique that's worked well: Action cost budgets. Each agent has a daily "budget" denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior. Graduated Autonomy and Action Cost Budget Semantic Houndaries What should the agent understand as in-scope vs out-of-scope? This is trickier because it's conceptual, not just technical. I've found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation. The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent's boundaries. You need multiple layers of defense here. Operational boundaries How much can the agent do, and how fast? This is your rate limiting and resource control. We've implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they're essential for preventing runaway behavior. We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would've hit a threshold and escalated to a human after attempt number 5. Agents need their own style of testing Traditional software testing doesn't cut it for autonomous agents. You can't just write test cases that cover all the edge cases, because with LLMs, everything is an edge case. What's worked for us: Simulation environments Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production. The key is making scenarios realistic. Don't just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can't handle a test environment where things go wrong, it definitely can't handle production. Red teaming Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to "trick" the agent into doing things it shouldn't. Shadow mode Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent's choices and the human's choices, and you analyze the delta. This is painful and slow, but it's worth it. You'll find all kinds of subtle misalignments you'd never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems. The human-in-the-loop pattern Three Human-in-the-Loop Patterns Despite all the automation, humans remain essential. The question is: Where in the loop? We're increasingly convinced that "human-in-the-loop" is actually several distinct patterns: Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations. Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations. Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they're better at. The agent does the grunt work, the human does the judgment calls. The trick is making these transitions smooth. An agent shouldn't feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent. Failure modes and recovery Let's be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically. We classify failures into three categories: Recoverable errors: The agent tries to do something, it doesn't work, the agent realizes it didn't work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn't making things worse, let it retry with exponential backoff. Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue. Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it's been misinterpreting customer requests for weeks. Maybe it's been making subtly incorrect data entries. These accumulate into systemic issues. The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies? The cost-performance tradeoff Here's something nobody talks about enough: reliability is expensive. Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes. You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system. We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does. Organizational challenges We'd be remiss if we didn't mention that the hardest parts aren't technical — they're organizational. Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it? How do you handle edge cases where the agent's logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who's at fault? What's your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems? These questions don't have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture. Where we go from here The industry is still figuring this out. There's no established playbook for building reliable autonomous agents. We're all learning in production, and that's both exciting and terrifying. What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems. You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities. We'll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it's six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed? This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you. Because in the end, building enterprise-grade autonomous AI agents isn't about making systems that work perfectly. It's about making systems that fail safely, recover gracefully, and learn continuously. And that's the kind of engineering that actually matters. Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer. Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.