How Do You Know If Your AI Agent Actually Works? A Guide to Systematic Evaluation

The Problem: You've Built an Agent, But Does It Actually Work?

You've spent days, maybe weeks, building an AI agent. It can answer questions, call tools, and hold conversations. You've tested it manually a few times, and it seems to work. But as you prepare to deploy it or share it with users, a nagging question emerges: How do you actually know it's reliable?

Manual testing is a start, but it's fragile. You might test the same three scenarios over and over, missing edge cases. You might change a prompt and break something you didn't think to check. You might not even have a clear definition of what "working" means for your agent. Is it enough that it gives a plausible-sounding answer, or does it need to be factually correct every time? Does it need to use tools in a specific order? Should it refuse certain requests?

Without a structured evaluation process, you're flying blind. You can't measure improvement, you can't catch regressions, and you can't confidently tell stakeholders that your agent meets a quality bar. This is especially true for agents built on frameworks like Google's Agent Development Kit (ADK), where the behavior can be complex and multi-turn.

The core issue is that evaluation is an afterthought. Most developers build first and hope to test later. But "later" often means a few manual spot-checks before shipping. What you need is a way to define what success looks like upfront, run your agent against those definitions automatically, and get clear, actionable feedback on where it fails.

What a Good Evaluation Solution Should Change

A proper evaluation system for an AI agent should do more than just run a few test cases. It should provide:

A structured way to define test cases. Not just "ask it a question," but a dataset with inputs, expected outputs, and the context needed to judge success.
Automated execution. You should be able to run your agent against the entire dataset with one command, not manually feeding it prompts.
Objective scoring. The results shouldn't depend on your subjective feeling. There should be metrics—either deterministic checks or LLM-as-judge evaluations—that produce consistent scores.
Actionable failure analysis. When something fails, you need to know why. Was it a hallucination? Did it pick the wrong tool? Did it give up too early? The feedback should point you toward a fix.
A way to track progress. After you make a change, you need to see if your scores improved without breaking other things.

This isn't about finding a magic "pass/fail" button. It's about creating a feedback loop that lets you systematically improve your agent's quality. You iterate: define a test, run it, see where it fails, fix the agent, and run it again. Over time, you build confidence that your agent behaves as intended across a range of scenarios.

Introducing google-agents-cli-eval: A Practical Option for ADK Agent Evaluation

If you're building agents with Google's Agent Development Kit (ADK) and using the agents-cli toolchain, the google-agents-cli-eval skill provides a structured framework for this evaluation loop. It's not a standalone application but a set of commands and methodologies integrated into the agents-cli tool.

The skill is designed around what Google calls the "Quality Flywheel"—an iterative process of preparing data, running inference, grading results, analyzing failures, and optimizing the agent. It's specifically tailored for ADK agents and leverages the Agent Platform's evaluation service for some advanced features.

Important context: This skill assumes you are already using agents-cli and have an ADK-based agent project. It's not a general-purpose evaluation tool for any AI model. It's deeply integrated with the ADK ecosystem.

How the Evaluation Workflow Works

The process follows a clear, five-stage loop. You don't always need to use every stage, but understanding the flow is key.

Stage 1: Prepare Your Evaluation Data

You need test cases. The skill suggests starting with a simple JSON dataset file, often scaffolded in tests/eval/datasets/. A basic dataset might look like this:

[
  {
    "input": "What's the weather in London?",
    "expected_output": "The current weather in London is...",
    "tools_used": ["weather_api"]
  }
]

For more complex, multi-turn conversations, you can use the agents-cli eval dataset synthesize command. This runs a simulated user against your live agent to generate realistic conversation traces, which can then be used as evaluation data. This is useful when you don't have real user logs yet.

Stage 2: Run Your Agent Against the Data

If you created the dataset manually, you need to run your agent on it. The command agents-cli eval generate executes your agent for each input in the dataset and saves the full conversation traces (including tool calls and intermediate steps) to an artifacts/traces/ directory.

If you used the synthesize command in Stage 1, you can skip this step because it already produced the traces.

Stage 3: Grade the Results (The Core Step)

This is where the evaluation happens. The agents-cli eval grade command takes the traces from Stage 2 and scores them against your defined metrics. It produces a detailed results file (both JSON and HTML) with scores and, importantly, the judge model's rationale for each score.

You can combine Stages 2 and 3 with the shortcut agents-cli eval run for the common case.

Stage 4: Analyze Failures

The HTML report from Stage 3 is your primary tool for understanding what went wrong. You can open it in a browser to see per-case scores and the judge's reasoning.

For larger datasets with many failures, the agents-cli eval analyze command can help. It uses an LLM to cluster failures and identify root causes (e.g., "30% of failures are due to premature conversation termination"). This is more efficient than reading every single failure report when you have dozens of cases.

Stage 5: Optimize and Fix

Based on the failure analysis, you edit your agent. This could mean:

Adjusting the system prompt or instructions.
Changing tool descriptions to guide the agent better.
Modifying the agent's orchestration logic.
Updating the evaluation dataset if the expected behavior was wrong.

The skill provides a mapping of common failures to fixes. For example, if the multi_turn_task_success metric is low, it means the agent isn't completing the user's goal—look for issues with tool selection, missing steps, or giving up too soon.

There's also an agents-cli eval optimize command that uses ADK's GEPA prompt optimization. Use this with caution. It's expensive (makes many LLM calls) and time-consuming. It's best used as a final step after you've manually fixed the obvious issues, and only if the remaining failures are clearly prompt-related.

Choosing the Right Metrics for Your Agent

The skill comes with a set of built-in metrics. Choosing the right ones depends on what you care about most.

multi_turn_task_success: The most important metric for most agents. Did the agent ultimately achieve what the user asked? This is a catch-all for goal completion.
multi_turn_trajectory_quality: Was the agent's reasoning path logical? Did it take efficient steps, or did it wander?
multi_turn_tool_use_quality: Specifically evaluates how well the agent used tools across the conversation. Did it call the right tools with the right arguments?
final_response_quality: Judges the quality of the agent's last response, without needing a ground-truth reference answer. Useful for open-ended tasks.
hallucination: Checks if the agent made factual claims not supported by its context or tools. Critical for RAG agents.
safety: Checks for policy violations.

You can run agents-cli eval metric list to see all available metrics. If none fit, you can write a custom metric using an LLM judge or deterministic Python code.

When to Use This Skill (and When Not To)

Good fit:

You are building an agent with the Google ADK.
You are using the agents-cli toolchain for development.
You want a structured, repeatable way to test agent quality.
Your agent involves multi-turn conversations and tool use.
You are comfortable with command-line tools and JSON configuration.

Not a good fit:

You are not using the ADK or agents-cli. This skill is tightly coupled to that ecosystem.
You need a simple, one-off test of a single prompt. Manual testing or a simpler script might be faster.
You are looking for a GUI-based evaluation platform. This is a CLI-driven workflow.
Your agent is purely generative (no tools, no multi-turn logic). Simpler metrics might suffice.

What to Inspect Before You Start

Before diving in, check these things:

Your agents-cli version. The skill requires agents-cli. Install it via uv tool install google-agents-cli (you'll need uv installed first).
Your project structure. If you used the /google-agents-cli-scaffold skill to create your project, you likely already have the tests/eval/ directory and eval_config.yaml set up. If not, you'll need to create them.
Your agent's stability. The evaluation process assumes your agent can run without crashing. If your agent is in a very early, unstable state, fix the critical bugs first.
Your definition of "good." Before writing eval cases, think about what success means for your agent. What should it do? What should it not do? This will guide your metric selection and dataset creation.

Safety and Repository Signals

Repository: The skill is part of the official google/agents-cli repository on GitHub, maintained by Google. This suggests a level of maintenance and alignment with the ADK roadmap.
License: Apache-2.0, which is permissive and standard for open-source projects.
Security Level: Marked as "Low" in the skill metadata. This likely refers to the skill's own risk profile, not your agent's. The evaluation commands themselves don't execute arbitrary code on your system beyond running your agent, but always review what commands you run.
Topics: The repository topics (google-cloud, gemini, agents, adk) confirm its focus on the Google AI ecosystem.

Getting Started: A Practical First Steps Checklist

Install the toolchain: Ensure uv is installed, then run uv tool install google-agents-cli.
Scaffold if needed: If starting a new project, consider using the /google-agents-cli-scaffold skill to get the standard directory structure.
Create a minimal dataset: Start with 1-2 simple test cases in tests/eval/datasets/basic-dataset.json. Focus on a core use case.
Run the full evaluation: Execute agents-cli eval run. This will run your agent on the dataset and grade it.
Inspect the HTML report: Open the generated results_<timestamp>.html file. Look at the scores and the judge's rationale for any failures.
Fix one thing: Based on the failure, make one targeted change to your agent (e.g., clarify a prompt instruction).
Re-run and compare: Run agents-cli eval run again. Use agents-cli eval compare <old_results>.json <new_results>.json to see if your change helped.

This iterative process—define, run, grade, fix, compare—is the core of the skill. It turns agent quality from a guessing game into a measurable, improvable property. Start small, focus on your most important use case first, and expand your test coverage as your agent matures.

For the full reference documentation, including dataset schemas and metric details, visit the google-agents-cli-eval skill page.

How Do You Know If Your AI Agent Actually Works? A Guide to Systematic Evaluation

The Problem: You've Built an Agent, But Does It Actually Work?

What a Good Evaluation Solution Should Change

Introducing google-agents-cli-eval: A Practical Option for ADK Agent Evaluation

How the Evaluation Workflow Works

Stage 1: Prepare Your Evaluation Data

Stage 2: Run Your Agent Against the Data

Stage 3: Grade the Results (The Core Step)

Stage 4: Analyze Failures

Stage 5: Optimize and Fix

Choosing the Right Metrics for Your Agent

When to Use This Skill (and When Not To)

What to Inspect Before You Start

Safety and Repository Signals

Getting Started: A Practical First Steps Checklist

summarize

writing-skills

doc-coauthoring

claude-api

Related Articles

How to Build and Deploy AI Agents Without Getting Lost in Boilerplate?

How Do You Deploy an AI Agent to Production Without Breaking Everything?

How Do You Debug a Production AI Agent When You Can't See What's Happening Inside?

How to Start a New AI Agent Project Without Getting Lost in Boilerplate

How Do You Write Agent Code Without Getting Lost in API Docs?

How to Write a PRD That Engineers Actually Use: A Structured Approach