We Iterate on Models. We Can Iterate on Evals, Too Building automated evals doesn’t need to be a huge investment. Start with a few quick-and-dirty examples and iterate!

Published

Apr 16, 2025

Reading time

3 min read

Dear friends,

I’ve noticed that many GenAI application projects put in automated evaluations (evals) of the system’s output probably later — and rely on humans to manually examine and judge outputs longer — than they should. This is because building evals is viewed as a massive investment (say, creating 100 or 1,000 examples, and designing and validating metrics) and there’s never a convenient moment to put in that up-front cost. Instead, I encourage teams to think of building evals as an iterative process. It’s okay to start with a quick-and-dirty implementation (say, 5 examples with unoptimized metrics) and then iterate and improve over time. This allows you to gradually shift the burden of evaluations away from humans and toward automated evals.

I wrote previously about the importance and difficulty of creating evals. Say you’re building a customer-service chatbot that responds to users in free text. There’s no single right answer, so many teams end up having humans pore over dozens of example outputs with every update to judge if it improved the system. While techniques like LLM-as-judge are helpful, the details of getting this to work well (such as what prompt to use, what context to give the judge, and so on) are finicky to get right. All this contributes to the impression that building evals requires a large up-front investment, and thus on any given day, a team can make more progress by relying on human judges than figuring out how to build automated evals.

I encourage you to approach building evals differently. It’s okay to build quick evals that are only partial, incomplete, and noisy measures of the system’s performance, and to iteratively improve them. They can be a complement to, rather than replacement for, manual evaluations. Over time, you can gradually tune the evaluation methodology to close the gap between the evals’ output and human judgments. For example:

It’s okay to start with very few examples in the eval set, say 5, and gradually add to them over time — or subtract them if you find that some examples are too easy or too hard, and not useful for distinguishing between the performance of different versions of your system.
It’s okay to start with evals that measure only a subset of the dimensions of performance you care about, or measure narrow cues that you believe are correlated with, but don’t fully capture, system performance. For example if, at a certain moment in the conversation, your customer-support agent is supposed to (i) call an API to issue a refund and (ii) generate an appropriate message to the user, you might start off measuring only whether or not it calls the API correctly and not worry about the message. Or if, at a certain moment, your chatbot should recommend a specific product, a basic eval could measure whether or not the chatbot mentions that product without worrying about what it says about it.

So long as the output of the evals correlates with overall performance, it’s fine to measure only a subset of things you care about when starting.

The development process thus comprises two iterative loops, which you might execute in parallel:

Iterating on the system to make it perform better, as measured by a combination of automated evals and human judgment;
Iterating on the evals to make them correspond more closely to human judgment.

As with many things in AI, we often don’t get it right the first time. So t’s better to build an initial end-to-end system quickly and then iterate to improve it. We’re used to taking this approach to building AI systems. We can build evals the same way.

To me, a successful eval meets the following criteria. Say, we currently have system A, and we might tweak it to get a system B:

If A works significantly better than B according to a skilled human judge, the eval should give A a significantly higher score than B.
If A and B have similar performance, their eval scores should be similar.

Whenever a pair of systems A and B contradicts these criteria, that is a sign the eval is in “error” and we should tweak it to make it rank A and B correctly. This is a similar philosophy to error analysis in building machine learning algorithms, only instead of focusing on errors of the machine learning algorithm's output — such as when it outputs an incorrect label — we focus on “errors” of the evals — such as when they incorrectly rank two systems A and B, so the evals aren’t helpful in choosing between them.

Relying purely on human judgment is a great way to get started on a project. But for many teams, building evals as a quick prototype and iterating to something more mature lets you put in evals earlier and accelerate your progress.

Keep building!

Andrew

Subscribe to The Batch