AI Disruption

AI Disruption

How Does Codex Stay on Track for Hours?

Why Codex fails on long tasks and how a dedicated Evaluator keeps it on track. Essential reading for AI agent builders.

Meng Li's avatar
Meng Li
Jun 29, 2026
∙ Paid

“AI Disruption” Publication 10100 Subscriptions 20% Discount Offer Link.


At night, I handed the tasks over to Codex and went to sleep. In the morning, I opened it full of anticipation, thinking it had completed all the tasks. The self-tests all passed, but when I actually ran it, everything was full of errors, and even the architectural direction had shifted.

Anthropic engineer Prithvi Rajasekaran recently published a long paper detailing the problems encountered when building long-horizon task Agents. Below is a partial interpretation of its content.

The root cause of long-horizon tasks going off track: When building long-horizon Agent applications, when the Agent evaluates the code it just wrote, it almost always praises itself.

User's avatar

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.
© 2026 Meng Li · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture