How Does Codex Stay on Track for Hours?

Why Codex fails on long tasks and how a dedicated Evaluator keeps it on track. Essential reading for AI agent builders.

Jun 29, 2026

∙ Paid

“AI Disruption” Publication 10100 Subscriptions 20% Discount Offer Link.

At night, I handed the tasks over to Codex and went to sleep. In the morning, I opened it full of anticipation, thinking it had completed all the tasks. The self-tests all passed, but when I actually ran it, everything was full of errors, and even the architectural direction had shifted.

Anthropic engineer Prithvi Rajasekaran recently published a long paper detailing the problems encountered when building long-horizon task Agents. Below is a partial interpretation of its content.

The root cause of long-horizon tasks going off track: When building long-horizon Agent applications, when the Agent evaluates the code it just wrote, it almost always praises itself.

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.