How to measure AI agent performance: Signals, metrics, and a practical framework

McKinsey just dropped its State of AI 2025 report, and the top stat is what you’d expect. 88% of companies say they’re using AI somewhere, but only a third are scaling AI beyond pilots. In other words, most aren’t seeing real impact...yet.

We hear this same story quite a bit at Tray. Usually along the lines of “We got our agent live, now we’re trying to figure out if it’s actually doing anything meaningful.”

The uncomfortable answer is a lot of teams can’t tell. The issue comes down to one question that many enterprise teams are asking…

How do we measure the effectiveness of our AI agents?

Why traditional metrics don’t always work for measuring AI agent effectiveness

With old-school automation, you had clear inputs, actions, and outputs. It’s kind of like wiring a light switch. If you flip it on, and the light comes on, that means it’s working. It's easy to test and easy to measure.

AI agents don’t really work like that. They’re more like employees. They interpret, reason, and sometimes, for lack of a better phrase, they “wing it.” For instance, you might get five different answers to the same input. They may all be valid, but they’re all unique. And that makes testing and measurement a lot more difficult and messy.

Just counting token usage or conversation volume isn’t quite right. That’s like measuring a car’s performance by how often you turn the key. It tells you that something happened, but not whether it got you where you needed to go.

The two sides of measuring agent effectiveness

To truly measure whether your agent is successfully doing its job, you must look at it from both a technical standpoint and a business value standpoint. To separate those two layers, ask yourself these two questions…

#1. Is the agent functioning well?

This is about execution. Is the agent doing what it’s supposed to do? Is it using tools correctly, retrieving the right info, and giving coherent answers?

These are the signals we’re seeing used by companies with agents in production:

Resolution accuracy: Did the agent answer the question? Was the response useful? Think of it like being a teacher and having to review the agent’s homework. We seem some teams using a second agent to evaluate the agent’s output.
Tool usage traces: If the agent is supposed to reset a password, did it actually access Okta? You can track tool usage to make sure the agent did more than just talk about the task, but that it actually accessed the tool and performed some action within it.
Content alignment:This is tracking how well the retrieved content matches the intent and is primarily for knowledge agents and the content they draw from and pull in. If a user asks about your PTO policy and the agent pulls a Slack message from 2021, that’s not very helpful and something needs fixing.
Feedback (and follow-up behavior): Having a thumbs up/down is a common way to gather feedback from users, but many people don’t bother to rate the responses they get. What the user does next is a better way to measure how well the agent responded to the prompt: did they ask again? Did they escalate the issue to a human? Did they give up? That’s your real signal.
Cost per successful resolution (ACCT): Total run cost divided by correct outcomes. Compare to human handling and scripted automation.
Agent value multiple (AVM): Business value divided by total agent cost. Use it to decide where to scale and where to pause.

Pricing models can also swing ROI. Message, conversation, or outcome-based pricing change unit economics as interactions get longer or more complex. Model it before you scale.

So while these signals help you debug and iterate, they don’t necessarily tell you whether the agent is driving actual business results. So on to that second question…

#2. Is the agent having an impact on the business?

Once the agent is doing its job, you’ll want to know if it’s doing it well. For that, you can fall on those more familiar metrics:

Ticket deflection
Time saved per task
Reduction in escalations
Higher throughput or coverage

For example, Apollo.io is deflecting ~40% of inbound support tickets with an AI agent, cutting support load and improving time to resolve. That’s a hard, measurable outcome and it translates directly to cost savings and improved support efficiency.

Get into weekly operating rhythm

Running a simple cadence to measure these things is something we see a lot of teams do so that they're always on top of things and ready to take action.

Something you can do daily is auto-grade a small sample per use case.
Weekly, you can review accuracy, ACCT, and escalations by topic, then ship one change to prompts, tools, or knowledge.
Monthly, plan to compare the human baseline and last month’s trend.

Why this is becoming so important

Box CEO Aaron Levie recently made the case that AI agents shift the vendor-customer dynamic because now the software is doing the work, not just providing the tools. That means agent effectiveness is now key to agent trust and adoption.

As he put it:

“Evals are essentially the new performance review system for autonomous work. And it will be even more important for agents than people given the blast radius of a good vs. bad agentic workflow.”

If agents are taking on meaningful work such as resetting passwords, answering customer tickets, and summarizing internal policies, then you need a clear, repeatable way to know they’re actually doing the job.

Close the loop or fly blind

Measuring agent effectiveness isn’t an exact science just yet. We’re still in early days. Yes, tracking the right metrics is important. But so is creating a feedback loop.

Some of the most successful teams we work with are using agents to evaluate other agents. Doing things like spotting weak responses, suggesting prompt tweaks, and flagging knowledge gaps. It’s become a key part of the development process for these teams.

In their report, McKinsey calls out workflow redesign as a top driver of AI value. They’re right, but it’s important to note that that only works if you’re capturing what’s actually happening and using that data to make changes.

Bottom line

This is new territory for everyone. Every step with AI agents comes with its own set of challenges, from pilot to production and beyond. Measuring agent effectiveness and proving it works consistently, at scale, is a new challenge that we try to help teams navigate.

Looking ahead, AI agent success won’t be about speed but about building with visibility, iteration, and scale in mind.

And if it’s hard to measure one agent, imagine what it looks like when you’ve got fifty.

If you want a formal way to calculate both sides of measuring agent effectiveness, Gartner outlines a practical framework with new metrics for agents, a workbook to simulate cost and value at scale, and guidance on where ROI can break.

Get the full framework and spreadsheet to model your agent costs and value.

How to think about measuring agent performance