Measuring Business Quality: Benchmark-Driven Feedback Loops
The preceding feedback layers all address IT-level concerns: does the code compile, do the tests pass, are there errors in production, is the environment consistent. But IT-level correctness does not equal business-level correctness.
A recommendation system passes all tests, CI is green across the board, deployment goes smoothly, observability metrics look fine. But after launch, it recommends the same product every time. From an IT perspective, flawless: fast response times, no errors, correct database reads and writes. From a business perspective, a complete failure. This kind of problem will not be caught by tests, because tests verify "the code does what it is supposed to do", not "what the code does is what users actually need."
Benchmarks provide a feedback channel in the business dimension. They measure not the correctness of the code, but the business value of its output: recommendation precision, search result relevance, risk model recall rate. These metrics do not have a simple pass or fail — they are a continuous score. Fix a bug, the benchmark rises by 3 points. Change an algorithm, the benchmark drops by 5 points. This signal is clear enough and fast enough to drive continuous iteration.
Benchmark-driven feedback loops matter because they pull the team's attention from "is the code correct" to "is the business outcome correct." Teams without benchmarks easily fall into a trap: all tests are green, all code reviews have passed, but users do not engage with the product after launch. Tests guarantee a floor (code does not break). Benchmarks pull towards a ceiling (how much business value does the code produce).
The challenge in building benchmarks is that business quality assessment often has no standard answer. Whether code is correct can be determined by running a test. Whether recommendations are good — who decides? One approach is to build evaluation sets from historical data: use past cases with known good outcomes as baselines and measure how a new version performs against them. Another approach is to feed A/B test results back into updating the benchmark. Either way, the key is making the benchmark a digitalised signal that Agents can consume, rather than a qualitative assessment requiring human subjective judgement.
For Agents, benchmarks provide an objective function they can optimise autonomously. An Agent can modify code, run the benchmark, observe the score change, and decide whether to keep the modification or revert it. This loop does not require human intervention, but it does require the benchmark itself to be trustworthy and stable. If benchmark noise is too high (scores differ on every run), the Agent cannot distinguish genuine improvements from random fluctuation. If the benchmark is disconnected from real business metrics, the Agent may optimise in a direction opposite to what the business actually needs.
Maintaining and evolving benchmarks is itself part of platform engineering. The business changes, user behaviour changes, and evaluation criteria need to change with them. A benchmark designed a year ago may no longer reflect current business priorities. Like specifications, benchmarks need continuous maintenance, or they degrade from useful signals into misleading noise.