Your Agent Is Running — But What Is It Actually Doing?
The agent processed 14,000 requests yesterday. Your dashboards show healthy CPU, normal memory, steady API latency. Everything looks green.
But did any of those requests loop on the same tool call for 47 iterations before timing out? Did the agent hallucinate a function parameter and silently recover? Did it access the production database when it should have only read from the replica? Your dashboards don't know. They were built for a different world — one where software is deterministic, single-request, and stateless.
Agents are none of those things.
Why your existing dashboards tell you nothing
Traditional monitoring assumes a simple contract: request comes in, response goes out, you measure duration, error rate, and throughput. That works for a web server. It fails for an agent that calls three tools, reasons about their outputs, decides it needs more data, calls two more tools, then produces a result that might be right or might be confidently wrong.
The unit of observability in an agent system isn't a request. It's a decision trace — the sequence of reasoning steps, tool invocations, state changes, and intermediate outputs that produced the final result. A CPU spike at 2:47 PM is just noise until you can connect it to the agent that was looping on a weather API because it couldn't parse the response format.
Consider a concrete example. An agent is supposed to reconcile invoices against purchase orders. It runs 200 times a day. The error rate on your dashboard is 3%. That seems acceptable. But the 3% represents 6 invoices where the agent marked a discrepancy that didn't exist, each one triggering a human review that takes 20 minutes. Nobody looks at the 97% success rate and asks about the false positive rate because the monitoring system doesn't distinguish between a failed API call and a bad reasoning step. Both are just "errors."
The dashboard is lying to you. Not out of malice. It just lacks the right categories.
What you actually need to see
Agent observability requires four distinct layers of information, and most tools cover only one or two.
Decision traces are the foundation. Why did the agent choose that action? What context was in its prompt when it made the call? Which previous step's output influenced this decision? A trace captures the full chain of reasoning — every model call, every tool invocation, every intermediate result. Without it, debugging an agent failure is guessing. "It failed on the third step" is vague. "It chose the 'escalate' tool because the confidence score on the invoice validation step was below 0.7" is actionable.
Then there are tool call logs. For every tool invocation, you need: what tool was called, what parameters were passed, what the tool returned, how long the call took, and whether the agent used the result correctly. The last point is the one most observability tools miss. An agent that calls a database query tool and ignores the result in its next reasoning step has a different kind of problem than an agent whose tool call timed out. But both patterns look like "successful tool call" in a naive log.
State transitions are the third layer. What did the agent know at each step? Agent state isn't just the conversation history. It includes loaded context, retrieved documents, cached tool results, and the current execution branch. When an agent forks a sub-task and the sub-agent returns, does the parent agent correctly incorporate the result? State observability catches the cases where agents lose context between steps, which is one of the most common failure modes in production.
Failure mode classification rounds it out. Not all failures are equal. The monitoring system needs to distinguish between loops (the agent repeats the same action without progress), stalls (the agent stops producing output), tool hallucinations (the agent invokes a tool that doesn't exist or passes bad parameters), context loss (the agent forgets earlier steps), and confidence failure (the agent produces a result with low certainty). Each of these needs a different response. A loop needs a kill switch. A tool hallucination needs schema validation. And confidence failure might need a human review. The monitoring system that treats all failures as "exceptions" isn't useful.
The tool landscape, honestly
The observability tools available today fall into three categories, and none of them fully solve the problem yet.
The first category is LLM-native platforms — LangSmith, Weights & Biases, and Braintrust. They started as prompt engineering and evaluation tools and added agent tracing as the market shifted. LangSmith's traces capture the full chain of LLM calls and tool invocations within LangGraph-based agents, and its playground lets you replay traces through different models. Weights & Biases added agent tracing in early 2026 and connects it to their existing experiment tracking infrastructure. Braintrust's evaluation workflows let you score agent outputs against rubrics and track regressions over time.
These tools are good at what they do, but they're optimized for the development cycle, not production. The traces are rich but expensive to store at scale. The alerting is basic. You can set thresholds on latency and error rate, but not on "the agent ignored a tool result" or "the confidence score dropped below 0.6 on a payment-related action."
The second category is open-source tracing, which covers a different set of tradeoffs. Arize Phoenix is the most mature option in this space. It integrates with OpenTelemetry, which means you can connect agent traces to your existing infrastructure monitoring. The open-source model means you control the data. That matters for regulated industries where traces can't go to a third party.
The tradeoff is maintenance. Phoenix requires infrastructure to run, schema management as agent architectures evolve, and ongoing work to keep up with framework changes. Teams using it in production typically dedicate one engineer to observability infrastructure.
The third category is custom tracing. This is what the teams doing the most advanced agent observability end up building. The pattern is consistent: instrument the agent framework with OpenTelemetry spans, emit structured logs with decision metadata, store traces in a time-series database, and build custom dashboards for the failure modes that matter to their specific deployment.
This is the most capable approach and the most expensive. The teams that go custom are typically running 500+ agents in production, have dedicated platform engineering teams, and have already hit the limits of off-the-shelf tools.
The debugging workflow that doesn't exist yet
Agent observability has one problem that makes everything else harder: you can't reproduce a failure.
In traditional software, a bug reproduces reliably given the same inputs. In agent systems, the same input produces different outputs because of model temperature, non-deterministic sampling, and the order in which async tool calls resolve. You can't just re-run the agent with the same prompt and expect the same failure.
What practitioners do instead is trace comparison — take the failed trace, find a successful trace with similar inputs, and compare them step-by-step to identify where they diverged. LangSmith supports this with its comparison view. Arize Phoenix has a similar feature for LLM traces. But the comparison is manual and time-consuming. Anomaly detection that automatically flags divergent traces is the next frontier, and nobody has cracked it yet.
The cost of not knowing
The engineering time spent debugging is the visible cost. The invisible one is the failures that go undetected.
An agent processing customer refunds runs for three weeks before anyone notices it has been approving refunds above its authorization limit. The tool call was successful. The transaction completed. The only signal was a pattern in the refund amounts that nobody was monitoring. That's a real incident from a mid-size e-commerce company in early 2026.
Or the compliance violation: an agent in a financial services deployment accessed a customer's transaction history without authorization because the identity chain wasn't logged. The tool call log showed "query_transactions" with a customer ID. It didn't show which agent invoked the tool, under whose delegation, or whether the access was within scope. The regulator noticed before the engineering team did.
Or the cost surprise: an agent looped on a summarization task for 8 hours, generating 400,000 tokens of repeated output. The inference bill for that single loop was higher than the entire month's projected cost. The monitoring system showed "error rate: 0%." The API calls were all technically successful. There was just no progress.
What's coming
The observability gap in agent systems is getting attention. I'm watching three areas.
OpenTelemetry semantic conventions for agents are in early draft review as of mid-2026. The community is working on standard span attributes for agent traces — tool call schemas, decision metadata, state transitions. This would let agent-specific observability tools plug into the infrastructure teams already run.
The convergence of tracing and evaluation is where the tool vendors are heading. LangSmith already connects traces to evaluation scores. Arize Phoenix has drift monitoring. The next step is automated root-cause analysis that surfaces the trace segment where the agent's behavior started to degrade. The tools that merge real-time monitoring with continuous evaluation will be the ones that production teams actually trust.
MCP observability extensions are coming through the protocol itself. The MCP protocol (covered in Article 007) is adding hooks for standardized trace context propagation across tool servers, structured logging for tool invocations, and the ability to attach evaluation metadata to individual tool calls. If MCP becomes the standard agent-tool interface, its observability features will be the baseline.
What this means for your team
If you have agents in production today and you're using an off-the-shelf monitoring tool, you're blind to the failure modes that actually matter. The dashboard is green. The agent is making mistakes.
The practical starting point: add structured logging to every tool invocation your agent makes. Include the parameter values, the raw output, and the agent's next reasoning step. Store these in a queryable format. Run a weekly audit of a random sample of traces. You'll find patterns your monitoring tool never surfaced. Then you'll know which observability platform to evaluate based on what you actually need to see.
Teams that take observability seriously now will have the production data to debug failures, the audit trails to satisfy regulators, and the cost visibility to budget accurately. The ones that don't will find their blind spots the expensive way: in the post-mortem.
Comentarios
Publicar un comentario