I’ve been thinking a lot about how we evaluate coding agents, especially after my last post on measuring coding assistants. The usual benchmarks focus on pass/fail results—did the agent solve the problem or not? But the more I tinker with these systems, the more it feels like that lens is too narrow. It ignores how the agent gets there.
Imagine two agents. Agent A spits out code that compiles and passes tests on the first try, but the code is messy, brittle, and hard to maintain. Agent B, on the other hand, starts off with a few errors, but quickly cleans them up across iterations, and the final product is well-structured and robust. Traditional benchmarks would rate A higher simply because it “passed.” But in reality, B might be the more reliable partner in the long run.
That’s why I’ve been exploring a different angle: testing agents based on iteration efficiency rather than just outcomes.
Measuring Iteration Efficiency
Here’s the rough idea:
- First-run cleanliness – Track how many errors an agent produces on its first attempt. These could be lint issues, compile errors, or violations from a quality gate like SonarQube.
- Convergence speed – Measure how quickly those errors drop to zero across iterations. Does the agent flail around, or does it zero in methodically?
- Final quality – Evaluate the maintainability of the final code: readability, test coverage, and complexity.
- Time to result – Track the actual wall-clock time it takes the agent to reach its final solution. All else equal, an agent that achieves the same quality in less time is the better agent.
- Cost – Track the amount of tokens used (and possibly their actual cost). This could be included as part of the score—or perhaps as a parallel score, so that we can compare both cost-sensitive and cost-agnostic results. There’s a subtle trade-off here: if we expect token usage to always trend downward, the cost metric might become less interesting over time. But if it’s weighted equally, rising token counts could unfairly penalize otherwise better solutions. Worth pondering whether cost should be an independent axis, or baked into the combined score.
Each of these can be normalized to a 0–1 score. Then, instead of averaging, use a geometric mean. That way, no single strength dominates—an agent that’s fast but produces sloppy code won’t get an inflated score, and one that writes clean code but never converges won’t, either.
At the same time, I don’t want to collapse everything into a single number. Keeping the full vector—cleanliness, convergence, quality, cost, time—lets us compare agents from different angles.
Making the Benchmark Real
For this to be meaningful, the benchmark itself has to be challenging. If the tasks are too simple, everyone gets a perfect score and the metric stops being useful. Real bug-fix and refactoring datasets like SWE-bench or program-repair suites provide a better foundation. They’re messy, nuanced, and much closer to the problems developers actually face.
One challenge: this inherently cannot be done as black-box testing. To measure iteration efficiency, we need visibility into the agent’s steps—tool calls, retries, and code revisions. That makes it hard to create a universal, externally verifiable leaderboard. Unless there’s some kind of private audit mechanism, we’ll have to accept that results may remain more internal than public.
Another consideration is developer experience. If I make an improvement to my agent and just want to see whether it broke things or maybe made something slightly better, I don’t want to wait hours—or even many minutes—for a full test suite to run. There should probably be a “fast mode” version of this benchmark: lightweight checks that give decent preliminary results quickly, with deeper evaluations available in the background.
Toward a Practical Tool
One possible outcome of testing this approach could be a small library developers can plug into their coding agent. The library would serve up benchmark tasks, let the agent report events (iterations, tool calls, fixes), and return a final score vector. That would make it easier to experiment locally without reinventing the evaluation machinery each time.
Why This Matters
This kind of benchmark helps us see whether an “improvement” is actually an improvement. Did the new agent version converge faster? Did it produce fewer initial errors? Did the final code hold up better under scrutiny? And did it do so at a reasonable cost? By tracking the whole journey instead of just the destination, we can compare agents more fairly—on speed, accuracy, maintainability, and efficiency.
There’s also a deeper benefit here: by running tests that explicitly require linting and quality gates, we shine a light on the weaknesses of today’s agents. It forces code quality to become a first-class concern rather than an afterthought. In other words, these benchmarks don’t just measure agents—they push them to improve in the areas that matter most for long-term maintainability.
I’m curious how this idea lands with you. Does it spark thoughts about how you’d want to measure your own coding agent—or about what trade-offs really matter to you: speed, cleanliness, maintainability, cost? I’d love to hear your perspective in the comments below, and maybe we can sharpen these ideas together.