How Does Codex Stay on Track for Hours?
Why Codex fails on long tasks and how a dedicated Evaluator keeps it on track. Essential reading for AI agent builders.
“AI Disruption” Publication 10100 Subscriptions 20% Discount Offer Link.
At night, I handed the tasks over to Codex and went to sleep. In the morning, I opened it full of anticipation, thinking it had completed all the tasks. The self-tests all passed, but when I actually ran it, everything was full of errors, and even the architectural direction had shifted.
Anthropic engineer Prithvi Rajasekaran recently published a long paper detailing the problems encountered when building long-horizon task Agents. Below is a partial interpretation of its content.
The root cause of long-horizon tasks going off track: When building long-horizon Agent applications, when the Agent evaluates the code it just wrote, it almost always praises itself.



