Testing autonomous agents (Or: how I learned to stop worrying and embrace chaos)
Our take

Look, we've spent the last 18 months building production AI systems, and we'll tell you what keeps us up at night — and it's not whether the model can answer questions. That's table stakes now. What haunts us is the mental image of an agent autonomously approving a six-figure vendor contract at 2 a.m. because someone typo'd a config file.
We've moved past the era of "ChatGPT wrappers" (thank God), but the industry still treats autonomous agents like they're just chatbots with API access. They're not. When you give an AI system the ability to take actions without human confirmation, you're crossing a fundamental threshold. You're not building a helpful assistant anymore — you're building something closer to an employee. And that changes everything about how we need to engineer these systems.
The autonomy problem nobody talks about
Here's what's wild: We've gotten really good at making models that *sound* confident. But confidence and reliability aren't the same thing, and the gap between them is where production systems go to die.
We learned this the hard way during a pilot program where we let an AI agent manage calendar scheduling across executive teams. Seems simple, right? The agent could check availability, send invites, handle conflicts. Except, one Monday morning, it rescheduled a board meeting because it interpreted "let's push this if we need to" in a Slack message as an actual directive. The model wasn't wrong in its interpretation — it was plausible. But plausible isn't good enough when you're dealing with autonomy.
That incident taught us something crucial: The challenge isn't building agents that work most of the time. It's building agents that fail gracefully, know their limitations, and have the circuit breakers to prevent catastrophic mistakes.
What reliability actually means for autonomous systems
Layered reliability architecture
When we talk about reliability in traditional software engineering, we've got decades of patterns: Redundancy, retries, idempotency, graceful degradation. But AI agents break a lot of our assumptions.
Traditional software fails in predictable ways. You can write unit tests. You can trace execution paths. With AI agents, you're dealing with probabilistic systems making judgment calls. A bug isn't just a logic error—it's the model hallucinating a plausible-sounding but completely fabricated API endpoint, or misinterpreting context in a way that technically parses but completely misses the human intent.
So what does reliability look like here? In our experience, it's a layered approach.
Layer 1: Model selection and prompt engineering
This is foundational but insufficient. Yes, use the best model you can afford. Yes, craft your prompts carefully with examples and constraints. But don't fool yourself into thinking that a great prompt is enough. I've seen too many teams ship "GPT-4 with a really good system prompt" and call it enterprise-ready.
Layer 2: Deterministic guardrails
Before the model does anything irreversible, run it through hard checks. Is it trying to access a resource it shouldn't? Is the action within acceptable parameters? We're talking old-school validation logic — regex, schema validation, allowlists. It's not sexy, but it's effective.
One pattern that's worked well for us: Maintain a formal action schema. Every action an agent can take has a defined structure, required fields, and validation rules. The agent proposes actions in this schema, and we validate before execution. If validation fails, we don't just block it — we feed the validation errors back to the agent and let it try again with context about what went wrong.
Layer 3: Confidence and uncertainty quantification
Here's where it gets interesting. We need agents that know what they don't know. We've been experimenting with agents that can explicitly reason about their confidence before taking actions. Not just a probability score, but actual articulated uncertainty: "I'm interpreting this email as a request to delay the project, but the phrasing is ambiguous and could also mean..."
This doesn't prevent all mistakes, but it creates natural breakpoints where you can inject human oversight. High-confidence actions go through automatically. Medium-confidence actions get flagged for review. Low-confidence actions get blocked with an explanation.
Layer 4: Observability and auditability
Action Validation Pipeline
If you can't debug it, you can't trust it. Every decision the agent makes needs to be loggable, traceable, and explainable. Not just "what action did it take" but "what was it thinking, what data did it consider, what was the reasoning chain?"
We've built a custom logging system that captures the full large language model (LLM) interaction — the prompt, the response, the context window, even the model temperature settings. It's verbose as hell, but when something goes wrong (and it will), you need to be able to reconstruct exactly what happened. Plus, this becomes your dataset for fine-tuning and improvement.
Guardrails: The art of saying no
Let's talk about guardrails, because this is where engineering discipline really matters. A lot of teams approach guardrails as an afterthought — "we'll add some safety checks if we need them." That's backwards. Guardrails should be your starting point.
We think of guardrails in three categories.
Permission boundaries
What is the agent physically allowed to do? This is your blast radius control. Even if the agent hallucinates the worst possible action, what's the maximum damage it can cause?
We use a principle called "graduated autonomy." New agents start with read-only access. As they prove reliable, they graduate to low-risk writes (creating calendar events, sending internal messages). High-risk actions (financial transactions, external communications, data deletion) either require explicit human approval or are simply off-limits.
One technique that's worked well: Action cost budgets. Each agent has a daily "budget" denominated in some unit of risk or cost. Reading a database record costs 1 unit. Sending an email costs 10. Initiating a vendor payment costs 1,000. The agent can operate autonomously until it exhausts its budget; then, it needs human intervention. This creates a natural throttle on potentially problematic behavior.
Graduated Autonomy and Action Cost Budget
Semantic Houndaries
What should the agent understand as in-scope vs out-of-scope? This is trickier because it's conceptual, not just technical.
I've found that explicit domain definitions help a lot. Our customer service agent has a clear mandate: handle product questions, process returns, escalate complaints. Anything outside that domain — someone asking for investment advice, technical support for third-party products, personal favors — gets a polite deflection and escalation.
The challenge is making these boundaries robust to prompt injection and jailbreaking attempts. Users will try to convince the agent to help with out-of-scope requests. Other parts of the system might inadvertently pass instructions that override the agent's boundaries. You need multiple layers of defense here.
Operational boundaries
How much can the agent do, and how fast? This is your rate limiting and resource control.
We've implemented hard limits on everything: API calls per minute, maximum tokens per interaction, maximum cost per day, maximum number of retries before human escalation. These might seem like artificial constraints, but they're essential for preventing runaway behavior.
We once saw an agent get stuck in a loop trying to resolve a scheduling conflict. It kept proposing times, getting rejections, and trying again. Without rate limits, it sent 300 calendar invites in an hour. With proper operational boundaries, it would've hit a threshold and escalated to a human after attempt number 5.
Agents need their own style of testing
Traditional software testing doesn't cut it for autonomous agents. You can't just write test cases that cover all the edge cases, because with LLMs, everything is an edge case.
What's worked for us:
Simulation environments
Build a sandbox that mirrors production but with fake data and mock services. Let the agent run wild. See what breaks. We do this continuously — every code change goes through 100 simulated scenarios before it touches production.
The key is making scenarios realistic. Don't just test happy paths. Simulate angry customers, ambiguous requests, contradictory information, system outages. Throw in some adversarial examples. If your agent can't handle a test environment where things go wrong, it definitely can't handle production.
Red teaming
Get creative people to try to break your agent. Not just security researchers, but domain experts who understand the business logic. Some of our best improvements came from sales team members who tried to "trick" the agent into doing things it shouldn't.
Shadow mode
Before you go live, run the agent in shadow mode alongside humans. The agent makes decisions, but humans actually execute the actions. You log both the agent's choices and the human's choices, and you analyze the delta.
This is painful and slow, but it's worth it. You'll find all kinds of subtle misalignments you'd never catch in testing. Maybe the agent technically gets the right answer, but with phrasing that violates company tone guidelines. Maybe it makes legally correct but ethically questionable decisions. Shadow mode surfaces these issues before they become real problems.
The human-in-the-loop pattern
Three Human-in-the-Loop Patterns
Despite all the automation, humans remain essential. The question is: Where in the loop?
We're increasingly convinced that "human-in-the-loop" is actually several distinct patterns:
Human-on-the-loop: The agent operates autonomously, but humans monitor dashboards and can intervene. This is your steady-state for well-understood, low-risk operations.
Human-in-the-loop: The agent proposes actions, humans approve them. This is your training wheels mode while the agent proves itself, and your permanent mode for high-risk operations.
Human-with-the-loop: Agent and human collaborate in real-time, each handling the parts they're better at. The agent does the grunt work, the human does the judgment calls.
The trick is making these transitions smooth. An agent shouldn't feel like a completely different system when you move from autonomous to supervised mode. Interfaces, logging, and escalation paths should all be consistent.
Failure modes and recovery
Let's be honest: Your agent will fail. The question is whether it fails gracefully or catastrophically.
We classify failures into three categories:
Recoverable errors: The agent tries to do something, it doesn't work, the agent realizes it didn't work and tries something else. This is fine. This is how complex systems operate. As long as the agent isn't making things worse, let it retry with exponential backoff.
Detectable failures: The agent does something wrong, but monitoring systems catch it before significant damage occurs. This is where your guardrails and observability pay off. The agent gets rolled back, humans investigate, you patch the issue.
Undetectable failures: The agent does something wrong, and nobody notices until much later. These are the scary ones. Maybe it's been misinterpreting customer requests for weeks. Maybe it's been making subtly incorrect data entries. These accumulate into systemic issues.
The defense against undetectable failures is regular auditing. We randomly sample agent actions and have humans review them. Not just pass/fail, but detailed analysis. Is the agent showing any drift in behavior? Are there patterns in its mistakes? Is it developing any concerning tendencies?
The cost-performance tradeoff
Here's something nobody talks about enough: reliability is expensive.
Every guardrail adds latency. Every validation step costs compute. Multiple model calls for confidence checking multiply your API costs. Comprehensive logging generates massive data volumes.
You have to be strategic about where you invest. Not every agent needs the same level of reliability. A marketing copy generator can be looser than a financial transaction processor. A scheduling assistant can retry more liberally than a code deployment system.
We use a risk-based approach. High-risk agents get all the safeguards, multiple validation layers, extensive monitoring. Lower-risk agents get lighter-weight protections. The key is being explicit about these trade-offs and documenting why each agent has the guardrails it does.
Organizational challenges
We'd be remiss if we didn't mention that the hardest parts aren't technical — they're organizational.
Who owns the agent when it makes a mistake? Is it the engineering team that built it? The business unit that deployed it? The person who was supposed to be supervising it?
How do you handle edge cases where the agent's logic is technically correct but contextually inappropriate? If the agent follows its rules but violates an unwritten norm, who's at fault?
What's your incident response process when an agent goes rogue? Traditional runbooks assume human operators making mistakes. How do you adapt these for autonomous systems?
These questions don't have universal answers, but they need to be addressed before you deploy. Clear ownership, documented escalation paths, and well-defined success metrics are just as important as the technical architecture.
Where we go from here
The industry is still figuring this out. There's no established playbook for building reliable autonomous agents. We're all learning in production, and that's both exciting and terrifying.
What we know for sure: The teams that succeed will be the ones who treat this as an engineering discipline, not just an AI problem. You need traditional software engineering rigor — testing, monitoring, incident response — combined with new techniques specific to probabilistic systems.
You need to be paranoid but not paralyzed. Yes, autonomous agents can fail in spectacular ways. But with proper guardrails, they can also handle enormous workloads with superhuman consistency. The key is respecting the risks while embracing the possibilities.
We'll leave you with this: Every time we deploy a new autonomous capability, we run a pre-mortem. We imagine it's six months from now and the agent has caused a significant incident. What happened? What warning signs did we miss? What guardrails failed?
This exercise has saved us more times than we can count. It forces you to think through failure modes before they occur, to build defenses before you need them, to question assumptions before they bite you.
Because in the end, building enterprise-grade autonomous AI agents isn't about making systems that work perfectly. It's about making systems that fail safely, recover gracefully, and learn continuously.
And that's the kind of engineering that actually matters.
Madhvesh Kumar is a principal engineer. Deepika Singh is a senior software engineer.
Views expressed are based on hands-on experience building and deploying autonomous agents, along with the occasional 3 AM incident response that makes you question your career choices.
Read on the original site
Open the publisher's page for the full experience
Related Articles
- Designing the agentic AI enterprise for measurable performancePresented by Edgeverve Smart, semi‑autonomous AI agents handling complex, real‑time business work is a compelling vision. But moving from impressive pilots to production‑grade impact requires more than clever prompts or proof‑of‑concept demos. It takes clear goals, data‑driven workflows, and an enterprise platform that balances autonomy, governance, observability, and flexibility with hard guardrails from day one. From pilots to the “operational grey zones” The next wave of value sits in the connective tissue between applications — those operational grey zones where handoffs, reconciliations, approvals, and data lookups still rely on humans. Assigning agents to these paths means collapsing system boundaries, applying intelligence to context, and re‑imagining processes that were never formally automated. Many pilots stall because they start as lab experiments rather than outcome‑anchored designs tied to production systems, controls, and KPIs. Start with outcomes, not algorithms. Translate organizational KPIs (cash‑flow, DSO, SLA adherence, compliance hit rates, MTTR, NPS, claims leakage, etc.) into agent goals, then cascade them into single‑agent and multi‑agent objectives. Only after goals are explicit should you select workflows and decompose tasks. Pick targets, then decompose the work What does “target” actually mean? In agentic programs, a target is a business outcome and the use case that moves it. For example, “reduce unapplied cash by 20%” target outcome; “cash application and exceptions handling” use case. With the use case in hand, perform persona‑level task decomposition: map the human role (e.g., cash applications analyst, facilities coordinator), enumerate their tasks, and identify which are ripe for agentification (data retrieval, matching, policy checks, decision proposals, transaction initiation). Delivering on those tasks requires a data‑embedded workflow fabric that can read, write, and reason across enterprise systems while honoring permissions. Data must be AI‑ready, discoverable, governed, labeled where needed, augmented for retrieval (RAG), and policy‑protected for PII, PCI, and regulatory constraints. Integration goes beyond APIs APIs are one mode of integration, not the only one. Robust agent execution typically blends: Stable APIs with lifecycle management for core systems Event‑driven triggers (streams, webhooks, CDC) to react in real time UI/RPA fallbacks where APIs don’t exist Search/RAG connectors for documents and knowledge bases Policy management across tools and actions to enforce entitlements and segregation of duties The north star is integration reliability — built on idempotency, retries, circuit-breakers, and standardized tool schemas — so agents don’t “hallucinate” actions the enterprise can’t verify. A quick example: finance and facilities, in production Inside our organization, we deployed specialized agents in a live CFO environment and in building maintenance. In finance, seven agents interacted with production systems and real accountability structures. Year‑one outcomes included: >3% monthly cash‑flow improvement, 50% productivity gain in affected workflows, 90% faster onboarding, a shift from account‑level handling to function‑level orchestration, and a $32M cash‑flow lift. These results don’t guarantee gains everywhere; they show that designing products can deliver measurable outcomes on a scale. The four design pillars: Autonomy, governance, observability & evals, flexibility 1) Autonomy: right‑size it to the risk Autonomy exists on a spectrum. Early efforts often automate well‑bounded tasks; others pursue research/analysis agents; increasingly, teams target mission‑critical transactional agents (payments, vendor onboarding, pricing changes). The rule: match autonomy to risk, and encode the operating mode suggest‑only, propose‑and‑approve, or execute‑with‑rollback per task. 2) Governance: guardrails by design, not as bolt‑ons Unbounded agents create unacceptable risk. Build guardrails into the plan: Policy & permissions: tie tools/actions to identity, scopes, and SoD rules. Human‑in‑the‑loop (HITL): where mission‑critical thresholds are crossed (amount, vendor risk, regulatory exposure). Agent lifecycle management: versioning, change control, regression gates, approval workflows, and sunsetting. Third‑party agent orchestration: vet external agents like vendors, capabilities, scopes, logs, SLAs. Incident and rollback: kill‑switches, safe‑mode, and compensating transactions. This is how you scale innovation safely while protecting brand, compliance, and customers. 3) Observability & evaluations: trust comes from telemetry Production agents need the same rigor as any core platform: Telemetry: capture full execution traces across perception, planning, tool use, action supported by structured logs and replay. Offline evals: cenario tests, red‑teaming, bias and safety checks, cost/performance benchmarks; baseline vs. challenger comparisons. Online evals: shadow mode, A/B, canary releases, guardrail breach alerts, human feedback loops. Explainability & auditability: why was an action taken, which data/tools were used, and who approved. 4) Flexibility: assume volatility, design for swap‑ability Models, tools, and vendors change fast. Treat agentic capability as platform currency: create an environment where teams can evaluate, select, and swap models/tools without tearing down the build. Use a model router, tool registry, and contract‑first interfaces so upgrades are controlled experiments, not rewrites. The agent platform fabric: how platformization turns goals into outcomes A true agentic enterprise requires a platform fabric that transforms goals into outcomes, not a patchwork of isolated pilots. This platform anchors enterprise‑to‑agent KPI cascades, drives task decomposition and multi‑agent planning, and provides governed tooling and data access across APIs, RPA, search, and databases. It centralizes knowledge and memory through RAG and vector stores, enforces enterprise controls via a policy engine, and manages performance and safety through a unified model layer. It supports robust orchestration of first‑ and third‑party agents with common context, embeds deep observability and evaluation pipelines, and applies disciplined release engineering from sandbox to GA. Finally, it ensures long‑term resilience through lifecycle management versioning, deprecation, incident playbooks, and auditable histories. Guardrails in action: a BFSI example Consider payments exception handling in banking — high stakes, regulated, and customer‑visible. An agent proposes a resolution (e.g., auto‑reconcile or escalate) only when: The transaction falls below risk thresholds; above them, it triggers HITL approval. All policy checks (KYC/AML, velocity, sanctions) pass. Observability hooks record rationale, tools invoked, and data used. Rollback/compensation is defined if downstream failures occur. This pattern generalizes to vendor onboarding, pricing overrides, or claims adjudication — mission‑critical work with explicit safety rails. Scale beyond pilots Scaling agentic AI beyond pilots demands disciplined readiness across nine fronts: leaders must clarify which KPIs matter and how agent goals ladder into them, determine which persona tasks are agentified versus remain human‑led, and align each with the right autonomy mode from suggest‑only to propose‑and‑approve to execute‑with‑rollback. They must embed governance guardrails, including HITL points and lifecycle controls; ensure robust observability and evaluation via telemetry, replay, audits, and offline/online tests; and verify data readiness, with governed, policy‑protected, retrieval‑augmented data flows. Integration must be reliable, with API lifecycle management, event triggers, and RPA/other fallbacks. The underlying platform should enable model swap‑ability and orchestration of first‑ and third‑party agents without rebuilding. Finally, measurement must focus on true operational impact cash flow, cycle times, quality, and risk reduction rather than task counts. The takeaway Agentic AI is not a shortcut; it’s a new system of work. Enterprises that approach it with platform discipline aligning autonomy with risk, embedding governance and observability, and designing for swap‑ability will convert pilots into production impact. Those that don’t keep accumulating impressive but disconnected demos. The difference isn’t how fast you ship an agent; it’s how deliberately you design the enterprise around it. N. Shashidar is SVP & Global Head, Product Management at EdgeVerve. Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
- Context decay, orchestration drift, and the rise of silent failures in AI systemsThe most expensive AI failure I have seen in enterprise deployments did not produce an error. No alert fired. No dashboard turned red. The system was fully operational, it was just consistently, confidently wrong. That is the reliability gap. And it is the problem most enterprise AI programs are not built to catch. We have spent the last two years getting very good at evaluating models: benchmarks, accuracy scores, red-team exercises, retrieval quality tests. But in production, the model is rarely where the system breaks. It breaks in the infrastructure layer, the data pipelines feeding it, the orchestration logic wrapping it, the retrieval systems grounding it, the downstream workflows trusting its output. That layer is still being monitored with tools designed for a different kind of software. The gap no one is measuring Here's what makes this problem hard to see: Operationally healthy and behaviorally reliable are not the same thing, and most monitoring stacks cannot tell the difference. A system can show green across every infrastructure metric, latency within SLA, throughput normal, error rate flat, while simultaneously reasoning over retrieval results that are six months stale, silently falling back to cached context after a tool call degrades, or propagating a misinterpretation through five steps of an agentic workflow. None of that shows up in Prometheus. None of it trips a Datadog alert. The reason is straightforward: Traditional observability was built to answer the question “is the service up?” Enterprise AI requires answering a harder question: “Is the service behaving correctly?” Those are different instruments. What teams typically measure What actually drives AI infrastructure failure Uptime / latency / error rate Retrieval freshness and grounding confidence Token usage Context integrity across multi-step workflows Throughput Semantic drift under real-world load Model benchmark scores Behavioral consistency when conditions degrade Infrastructure error rate Silent partial failure at the reasoning layer Closing this gap requires adding a behavioral telemetry layer alongside the infrastructure one — not replacing what exists, but extending it to capture what the model actually did with the context it received, not just whether the service responded. Four failure patterns that standard monitoring will not catch Across enterprise AI deployments in network operations, logistics, and observability platforms, I see four failure patterns repeat with enough consistency to name them. The first is context degradation. The model reasons over incomplete or stale data in a way that is invisible to the end user. The answer looks polished. The grounding is gone. Detection usually happens weeks later, through downstream consequences rather than system alerts. The second is orchestration drift. Agentic pipelines rarely fail because one component breaks. They fail because the sequence of interactions between retrieval, inference, tool use, and downstream action starts to diverge under real-world load. A system that looked stable in testing behaves very differently when latency compounds across steps and edge cases stack. The third is a silent partial failure. One component underperforms without crossing an alert threshold. The system degrades behaviorally before it degrades operationally. These failures accumulate quietly and surface first as user mistrust, not incident tickets. By the time the signal reaches a postmortem, the erosion has been happening for weeks. The fourth is the automation blast radius. In traditional software, a localized defect stays local. In AI-driven workflows, one misinterpretation early in the chain can propagate across steps, systems, and business decisions. The cost is not just technical. It becomes organizational, and it is very hard to reverse. Metrics tell you what happened. They rarely tell you what almost happened. Why classic chaos engineering is not enough and what needs to change Traditional chaos engineering asks the right kind of question: What happens when things break? Kill a node. Drop a partition. Spike CPU. Observe. Those tests are necessary, and enterprises should run them. But for AI systems, the most dangerous failures are not caused by hard infrastructure faults. They emerge at the interaction layer between data quality, context assembly, model reasoning, orchestration logic, and downstream action. You can stress the infrastructure all day and never surface the failure mode that costs you the most. What AI reliability testing needs is an intent-based layer: Define what the system must do under degraded conditions, not just what it should do when everything works. Then test the specific conditions that challenge that intent. What happens if the retrieval layer returns content that is technically valid but six months outdated? What happens if a summarization agent loses 30% of its context window to unexpected token inflation upstream? What happens if a tool call succeeds syntactically but returns semantically incomplete data? What happens if an agent retries through a degraded workflow and compounds its own error with each step? These scenarios are not edge cases. They are what production looks like. This is the framework I have applied in building reliability systems for enterprise infrastructure: Intent-based chaos level creation for distributed computing environments. The key insight: Intent defines the test, not just the fault. What the infrastructure layer actually needs None of this requires reinventing the stack. It requires extending four things. Add behavioral telemetry alongside infrastructure telemetry. Track whether responses were grounded, whether fallback behavior was triggered, whether confidence dropped below a meaningful threshold, whether the output was appropriate for the downstream context it entered. This is the observability layer that makes everything else interpretable. Introduce semantic fault injection into pre-production environments. Deliberately simulate stale retrieval, incomplete context assembly, tool-call degradation, and token-boundary pressure. The goal is not theatrical chaos. The goal is finding out how the system behaves when conditions are slightly worse than your staging environment — which is always what production is. Define safe halt conditions before deployment, not after the first incident. AI systems need the equivalent of circuit breakers at the reasoning layer. If a system cannot maintain grounding, validate context integrity, or complete a workflow with enough confidence to be trusted, it should stop cleanly, label the failure, and hand control to a human or a deterministic fallback. A graceful halt is almost always safer than a fluent error. Too many systems are designed to keep going because confident output creates the illusion of correctness. Assign shared ownership for end-to-end reliability. The most common organizational failure is a clean separation between model teams, platform teams, data teams, and application teams. When the system is operationally up but behaviorally wrong, no one owns it clearly. Semantic failure needs an owner. Without one, it accumulates. The maturity curve is shifting For the last two years, the enterprise AI differentiator has been adoption — who gets to production fastest. That phase is ending. As models commoditize and baseline capability converges, competitive advantage will come from something harder to copy: The ability to operate AI reliably at scale, in real conditions, with real consequences. Yesterday’s differentiator was model adoption. Today’s is system integration. Tomorrow’s will be reliability under production stress. The enterprises that get there first will not have the most advanced models. They will have the most disciplined infrastructure around them — infrastructure that was tested against the conditions it would actually face, not the conditions that made the pilot look good. The model is not the whole risk. The untested system around it is. Sayali Patil is an AI infrastructure and product leader.
- AI agents are running hospital records and factory inspections. Enterprise IAM was never built for them.A doctor in a hospital exam room watches as a medical transcription agent updates electronic health records, prompts prescription options, and surfaces patient history in real time. A computer vision agent on a manufacturing line is running quality control at speeds no human inspector can match. Both generate non-human identities that most enterprises cannot inventory, scope, or revoke at machine speed. That is the structural problem keeping agentic AI stuck in pilots. Not model capability. Not compute. Identity governance. Cisco President Jeetu Patel told VentureBeat at RSAC 2026 that 85% of enterprises are running agent pilots while only 5% have reached production. That 80-point gap is a trust problem. The first questions any CISO will ask: which agents have production access to sensitive systems, and who is accountable when one acts outside its scope? IANS Research found that most businesses still lack role-based access control mature enough for today's human identities, and agents will make it significantly harder. The 2026 IBM X-Force Threat Intelligence Index reported a 44% increase in attacks exploiting public-facing applications, driven by missing authentication controls and AI-enabled vulnerability discovery. Why the trust gap is architectural, not just a tooling problem Michael Dickman, SVP and GM of Cisco's Campus Networking business, laid out a trust framework in an exclusive interview with VentureBeat that security and networking leaders rarely hear stated this plainly. Before Cisco, Dickman served as Chief Product Officer at Gigamon and SVP of Product Management at Aruba Networks. Dickman said that the network sees what other telemetry sources miss: actual system-to-system communications rather than inferred activity. "It's that difference of knowing versus guessing," he said. "What the network can see are actual data communications … not, I think this system needs to talk to that system, but which systems are actually talking together." That raw behavioral data, he added, becomes the foundation for cross-domain correlation, and without it, organizations have no reliable way to enforce agent policy at what he called "machine speed." The trust prerequisite that most AI strategies skip Dickman argues that agentic AI breaks a pattern he says defined every prior technology transition: deploy for productivity first, bolt on security later. "I don't think trust is one of those things where the business productivity comes first, and the security is an afterthought," Dickman told VentureBeat. "Trust actually is one of the key requirements. Just table stakes from the beginning." Observing data and recommending decisions carries consequences that stay contained. Execution changes everything. When agents autonomously update patient records, adjust network configurations, or process financial transactions, the blast radius of a compromised identity expands dramatically. "Now more than ever, it's that question of who has the right to do what," Dickman said. "The who is now much more complicated because you have the potential in our reality of these autonomous agents." Dickman breaks the trust problem into four conditions. The first is secure delegation, which starts by defining what an agent is permitted to do and maintaining a clear chain of human accountability. The second is cultural readiness; he pointed to alert fatigue as a case study. The traditional fix, Dickman noted, was to aggregate alerts, so analysts see fewer items. With agents capable of evaluating every alert, that logic changes entirely. "It is now possible for an agent to go through all alerts," Dickman said. "You can actually start to think about different workflows in a different way. And then how does that affect the culture of the work, which is amazing." The third is token economics: Every agent’s action carries a real computational cost. Dickman sees hybrid architectures as the answer, where agentic AI handles reasoning while traditional deterministic tools execute actions. The fourth is human judgment. For example, his team used an AI tool to draft a product requirements document. The agent produced 60 pages of repetitive filler that immediately provided how technically responsive the architecture was, yet showed signs of needing extensive fine-tuning to make the output relevant. "There's no substitute for the human judgment and the talent that's needed to be dextrous with AI," he said. What the network sees that endpoints miss Most enterprise data today is proprietary, internal, and fragmented across observability tools, application platforms, and security stacks. Each domain team builds its own view. None sees the full picture. "It's that difference of knowing versus guessing," Dickman said. "What the network can see are actual data communications. Not 'I think this system needs to talk to that system,' but which systems are actually talking together." That telemetry grows more valuable as IoT and physical AI proliferate. Computer vision agents analyzing shopper behavior and running factory-floor quality control generate highly sensitive data that demands precise access controls. "All of those things require that trust that we started with, because this is highly sensitive data around like who's doing what in the shop or what's happening on the factory floor," Dickman said. Why siloed agent data misses the signal "It's not only aggregation, but actually the creation of knowledge from the network," Dickman said. "There are these new insights you can get when you see the real data communications. And so now it becomes what do we do first versus second versus third?" That last question reveals where Dickman’s focus lands: the strategic challenge is sequencing, not capability. "The real power comes from the cross-domain views. The real power comes from correlation," Dickman said. "Versus just aggregation and deduplication of alerts, which is good, but it's a little bit basic." This is where he sees the most common pitfall. Team A builds Agent A on top of Data A. Team B builds Agent B on top of Data B. Each silo produces incrementally useful automation. The cross-domain insight never materializes. Independent practitioners validate the pattern. Kayne McGladrey, an IEEE senior member, told VentureBeat that organizations are defaulting to cloning human user profiles for agents, and permission sprawl starts on day one. Carter Rees, VP of AI at Reputation, identified the structural reason. "A significant vulnerability in enterprise AI is broken access control, where the flat authorization plane of an LLM fails to respect user permissions," Rees told VentureBeat. Etay Maor, VP of Threat Intelligence at Cato Networks, reached the same conclusion from the adversarial side. "We need an HR view of agents," Maor told VentureBeat at RSAC 2026. "Onboarding, monitoring, offboarding." Agentic AI trust gap assessment Use this matrix to evaluate any platform or combination of platforms against the five trust gaps Dickman identified. Note that the enforcement approaches in the right column reflect Cisco's framework. Trust gap Current control failure What network-layer enforcement changes Recommended action Agent identity governance IAM built for human users cannot inventory, scope, or revoke agent identities at machine speed Agentic IAM registers each agent with defined permissions, an accountable human owner, and a policy-governed access scope Audit every agent identity in production. Assign a human owner. Define permitted actions before expanding the scope Blast radius containment Host-based agents and perimeter controls can be bypassed; flat segments give compromised agents lateral movement Microsegmentation enforces least-privileged access at the network layer, limiting blast radius independent of host-level controls Implement microsegmentation for every agent-accessible system. Start with the highest-sensitivity data (PHI, financial records) Cross-domain visibility Siloed observability tools create fragmented views; Team A's agent data never correlates with Team B's security telemetry Network telemetry captures actual system-to-system communications, feeding a unified data fabric for cross-domain correlation Unify network, security, and application telemetry into a shared data fabric before deploying production agents Governance-to-enforcement pipeline No formal process connecting business intent to agent policy to network enforcement Policy-to-enforcement pipeline translates governance decisions into machine-speed network rules Establish a formal pipeline from business-intent definition to automated network policy enforcement Cultural and workflow readiness Organizations automate existing workflows rather than redesigning for agent-scale processing Network-generated behavioral data reveals actual usage patterns, informing workflow redesign Run a 30-day telemetry capture before designing agent workflows. Build around observed data, not assumptions A broken ankle and a microsegmentation lesson Dickman grounded his framework in a scenario from his own life. A family member recently broke an ankle, which put him in a hospital exam room watching a medical transcription agent update the EHR, prompt prescription options, and surface patient history in real time. The doctor approved each decision, but the agent handled tasks that previously required manual entry across multiple systems. The security implications hit differently when it is a loved one's records on the screen. "I would call it do governance slowly. But do the enforcement and implementation rapidly," he said. "It must be done in machine speed." It starts with agentic IAM, where each agent is registered with defined permitted actions and a human accountable for its behavior. "Here's my set of agents that I've built. Here are the agents. By the way, here's a human who's accountable for those agents," Dickman said. "So if something goes wrong, there's a person to talk to." That identity layer feeds microsegmentation — a network-enforced boundary Dickman says enforces least-privileged access and limits blast radius. "Microsegmentation guarantees that least-privileged access," Dickman said. "You're not relying on a bunch of host agents, which can be bypassed or have other issues." If the governance model works for a medical transcription agent handling patient records in an emergency department, it scales to less sensitive enterprise use cases. Five priorities before agents reach production 1. Force cross-functional alignment now. Define what the organization expects from agentic AI across line-of-business, IT, and security leadership. Dickman sees the human coordination layer moving more slowly than the technology. That gap is the bottleneck. 2. Get IAM and PAM governance production-ready for agents. Dickman called out identity and access management and privileged access management specifically as not mature enough for agentic workloads today. Solidify the governance before scaling the agents. "That becomes the unlock of trust," he said. "Because when the technology platform is ready, you then need the right governance and policy on top of that." 3. Adopt a platform approach to networking infrastructure. A platform strategy enables data sharing across domains in ways fragmented point solutions cannot. That shared foundation is what makes the cross-domain correlation in the trust gap assessment above operationally real. 4. Design hybrid architectures from the start. Agentic AI handles reasoning and planning. Traditional deterministic tools execute the actions. Dickman sees this combination as the answer to token economics: it delivers the intelligence of foundation models with the efficiency and predictability of conventional software. Do not build pure-agent systems when hybrid systems cost less and fail more predictably. 5. Make the first use cases bulletproof on trust. Pick two or three high-value use cases and build them with role-based access control, privileged access management, and microsegmentation from day one. Even modest deployments delivered with best practices intact build the organizational confidence that accelerates everything after. "You can guarantee that trust to the organization, and that will unleash the speed," Dickman said. That is the structural insight running through every section of this conversation. The 85% of enterprises stuck in pilot mode are not waiting for better models. They are waiting for the identity governance, the cross-domain visibility, and the policy enforcement infrastructure that makes production deployment defensible. Whether they build on Cisco’s platform or assemble their own, Dickman’s framework holds: identity governance, cross-domain visibility, policy enforcement. None of those prerequisites is optional. The organizations that satisfy them first will deploy agents at a pace the rest cannot match, because every new agent inherits the trust architecture the first ones required. The ones still debating whether to start will watch that gap widen. Theoretical trust does not ship.