Why AI models top the benchmark and fail the job

A benchmark scores one step at a time, but a real job is a chain of tasks or steps. For a job to be successful every step has to be correct, if not then the odds fall away faster than the per-step score suggests.

Published June 18, 2026·Updated June 28, 2026

01 / 04

An agent that clears 95% of steps finishes a clean 20-step job only 36% of the time. Each step succeeds nineteen times out of twenty, the kind of number that tops a benchmark. But one slip ruins the whole job and the benchmark rarely highlights this challenge.

02 / 04

If you hold that 95% accuracy per step then by the 14th step the job's chance of success is equivalent to a coin flip. The whole job only works if every step works, so you multiply that 95% by itself once for each step in the chain. Two steps is 95% of 95%, and by fourteen steps it is down to a coin flip. So the line falls much faster than the per-step number in the benchmark suggests.

03 / 04

Now let's look at the big picture. Run 100 of these jobs instead of one, and 64 of them break somewhere before the end. Most of them fail quietly, with no error and no warning that anything went wrong. So most of the time the job comes back looking finished when it is not.

04 / 04

So, what is the solution? Insert a checkpoint every few steps that catches a mistake the moment it happens, so the errors never get the chance to pile up. This is something you build into the system yourself, not something a stronger model gives you for free. Ultimately, the real number to engineer around is how many steps you allow between one check and the next in order to get the job finished successfully.

Why AI models top the benchmark and fail the job

Get the next one in your inbox

Sources and method