How Do You Debug a Production AI Agent When You Can't See What's Happening Inside?

The Black Box Problem: When Your AI Agent Fails in Production

You've built an AI agent using the Agent Development Kit (ADK). It works perfectly in your local testing environment. You deploy it to production, and for a few days, everything seems fine. Then, users start reporting unexpected responses, slow performance, or outright errors. You check the standard application logs, but they only show high-level request/response cycles. You have no visibility into the agent's internal decision-making process. Which tool did it call? What was the exact prompt sent to the LLM? Where did the latency spike occur in the chain of agent runs?

This is the black box problem. Without deep observability, debugging a production AI agent becomes a frustrating exercise in guesswork. You're left trying to reproduce issues locally, which is often impossible because the problem stems from specific user inputs, tool behaviors, or LLM non-determinism in the live environment. You need a way to trace the entire execution flow, log the critical GenAI interactions, and analyze patterns over time.

A good solution should provide:

Distributed tracing to visualize the step-by-step execution of an agent invocation, including LLM calls and tool executions.
Detailed logging of prompts and responses for auditing and debugging, with configurable privacy controls.
Structured analytics to aggregate agent events for performance monitoring and evaluation.
Integration flexibility to use your preferred observability platform, whether it's a Google Cloud service or a third-party tool.

The goal is to move from guessing to knowing. You need to see the exact path an agent took, the data it processed, and the decisions it made, all without adding significant overhead to your development workflow.

Introducing a Practical Observability Skill for ADK Agents

If you're building agents with the Google Agent Development Kit (ADK) and facing the black box problem, the google-agents-cli-observability skill is a structured guide worth inspecting. It's not a magic bullet, but a curated set of practices and configurations for adding observability to your ADK projects. It leverages the agents-cli tool to help scaffold and manage the necessary infrastructure and code patterns.

This skill is part of a broader suite for ADK development. It focuses specifically on the "monitor and debug" phase, distinct from skills for deployment or core code patterns. Think of it as a reference manual for setting up the "eyes and ears" of your deployed agent.

What This Skill Actually Covers: Observability Tiers

The skill breaks down observability into distinct tiers, allowing you to choose the right level of detail for your needs. You don't have to implement everything at once.

Tier 1: Distributed Tracing with Cloud Trace

This is the foundational layer. ADK uses OpenTelemetry to automatically generate traces for every agent invocation. You can see a hierarchical span tree: the top-level invocation, child agent_run spans for each agent in a chain, and further children for call_llm and execute_tool.

What it solves: Understanding execution flow, identifying latency bottlenecks, and pinpointing where errors occur.
Setup: For agents deployed to Agent Runtime, Cloud Run, or GKE using the scaffolded project, tracing is enabled automatically. For local development, it works with the agents-cli playground.
Where to look: Traces appear in the Google Cloud Console under Trace > Trace explorer.
Key consideration: This tier is always on by default for scaffolded deployments. It's low-overhead and provides immediate value for debugging.

Tier 2: Prompt-Response Logging

This tier captures the actual GenAI interactions: the model name, token counts, timing, and optionally, the content of prompts and responses. Data is exported to Cloud Storage (as JSONL files) and BigQuery.

What it solves: Auditing LLM interactions for compliance, debugging prompt engineering issues, and understanding token usage.
Setup: Requires infrastructure provisioning (a GCS bucket, BigQuery dataset, and service account permissions). The skill provides a Terraform command to set this up: agents-cli infra single-project --project PROJECT_ID.
Privacy control: A critical environment variable, OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT, controls content capture. The scaffolded project defaults to NO_CONTENT (metadata only) for privacy. You must explicitly configure it to capture full prompts/responses.
Key consideration: This is disabled by default for local development. It's primarily for deployed agents where you need an audit trail.

Tier 3: BigQuery Agent Analytics Plugin

This is an opt-in plugin that logs structured agent events (LLM calls, tool use, outcomes) directly to BigQuery in a format optimized for analysis.

What it solves: Building custom dashboards, performing conversational analytics, and running LLM-as-judge evaluations on historical data.
Setup: Enabled during project scaffolding with the --bq-analytics flag. It requires the same underlying infrastructure as prompt-response logging.
Key consideration: This is for teams that want to do deep, SQL-based analysis of agent behavior over time. It's more structured than raw log exports.

Tier 4: Third-Party Integrations

The ADK supports integration with several external observability platforms like AgentOps, Phoenix, MLflow, and others. Each has different strengths.

What it solves: Using specialized visualization, team collaboration features, or prompt management tools from a vendor you already use.
Setup: Each platform has its own setup complexity, usually involving adding a specific exporter or SDK. The skill provides a comparison table to help you choose.
Key consideration: These integrations often replace or augment the native Google Cloud telemetry. You need to evaluate the trade-offs in terms of cost, data sovereignty, and feature set.

When to Use This Skill (and When Not To)

Consider inspecting this skill if:

You are building agents with the Google ADK and deploying them to Google Cloud (Agent Runtime, Cloud Run, GKE).
You have a deployed agent that is behaving unexpectedly, and you lack visibility into its internal operations.
You need to set up a structured logging and tracing pipeline for compliance or performance monitoring.
You are evaluating different observability platforms and want to understand the integration points with ADK.

This skill is probably not the right fit if:

You are not using the Google ADK. The configurations and CLI commands are specific to this framework.
You are looking for a general-purpose application performance monitoring (APM) guide. This is ADK-specific.
Your agent is still in the early prototyping phase and you only need basic print statement debugging. The overhead of setting up full observability isn't justified yet.
You need guidance on deploying the agent itself. For that, you should look at a deployment-focused skill like google-agents-cli-deploy.

Critical Setup Context and Order of Operations

One of the most important details in this skill is the order of operations for deployments to Agent Runtime. The Terraform module that provisions observability infrastructure (service accounts, buckets, datasets) also manages the Reasoning Engine resource itself.

Correct sequence: Run agents-cli infra single-project before your first agents-cli deploy. This ensures Terraform owns the resource from the start.
Problem scenario: If you run agents-cli deploy first (creating the Reasoning Engine via the SDK), and then run the Terraform infra command, you create a state mismatch. Terraform cannot cleanly layer environment variables onto an SDK-deployed instance it doesn't manage.
Recovery options:
1. Switch to Terraform-managed: Delete the existing Reasoning Engine, then run infra followed by deploy. This loses any in-flight sessions.
2. Keep the SDK-deployed instance: Manually set the required environment variables and IAM permissions on the running instance. This is more manual and you lose the benefits of Terraform state management.

This is a crucial operational detail. Getting the sequence wrong can lead to a frustrating cleanup process.

Safety Signals and Repository Health

When evaluating any open-source skill or tool, it's wise to check its provenance and health.

Repository: google/agents-cli on GitHub.
Owner: Google. This indicates the tool is officially supported as part of the ADK ecosystem, which reduces the risk of abandonment.
Stars & Forks: 3081 stars (as of the data snapshot) suggest significant community interest and adoption. Zero forks might indicate the repository is very new or that contributions are managed differently.
License: Apache-2.0. This is a permissive open-source license, allowing for broad use and modification.
Security Level: Marked as "Low" in the skill metadata. This likely refers to the risk profile of the CLI tool itself, which primarily scaffolds code and runs Terraform. You should still review the Terraform plans it generates for your specific cloud environment.
Topics: The repository is tagged with google-cloud, gemini, agents, adk, etc., confirming its specific domain.

Before you run any commands, especially infrastructure provisioning ones like agents-cli infra single-project, you should:

Review the generated Terraform files in your scaffolded project to understand exactly what cloud resources will be created.
Ensure you have the necessary IAM permissions in your Google Cloud project to create service accounts, buckets, and BigQuery datasets.
Run the command in a non-production project first to validate the setup.

Making the Decision: Does This Fit Your Workflow?

This skill provides a comprehensive, opinionated path to ADK observability. It's a good fit if you want a guided, infrastructure-as-code approach that integrates tightly with Google Cloud services. The tiered model lets you start simple (just tracing) and add more detail as needed.

However, it requires buy-in to the agents-cli workflow and Terraform for infrastructure management. If your team prefers to manually configure OpenTelemetry exporters or use a different infrastructure tool, you might use this skill as a reference but implement the concepts differently.

The key is to treat it as a practical option to inspect, not as a mandatory step. Read through the skill's detailed guide, understand the tiers, and check the setup prerequisites. If the problem of debugging a black-box agent resonates with you, and you're in the ADK ecosystem, this skill offers a clear, structured path forward.

How Do You Debug a Production AI Agent When You Can't See What's Happening Inside?

The Black Box Problem: When Your AI Agent Fails in Production

Introducing a Practical Observability Skill for ADK Agents

What This Skill Actually Covers: Observability Tiers

Tier 1: Distributed Tracing with Cloud Trace

Tier 2: Prompt-Response Logging

Tier 3: BigQuery Agent Analytics Plugin

Tier 4: Third-Party Integrations

When to Use This Skill (and When Not To)

Critical Setup Context and Order of Operations

Safety Signals and Repository Health

Making the Decision: Does This Fit Your Workflow?

docker-expert

excalidraw-diagram-generator

hyperframes-cli

vercel-cli-with-tokens

Related Articles

How Do You Deploy an AI Agent to Production Without Breaking Everything?

How to Start a New AI Agent Project Without Getting Lost in Boilerplate

How Do You Know If Your AI Agent Actually Works? A Guide to Systematic Evaluation

How to Build and Deploy AI Agents Without Getting Lost in Boilerplate?

How Do You Write Agent Code Without Getting Lost in API Docs?

How to Turn Complex Ideas into Clear Diagrams Without Manual Drawing?