OpenAI Open-Sources PaperBench: Claude Leads, Reshaping AI Benchmarking

OpenAI's PaperBench reveals Claude-3.5-Sonnet dominates AI paper replication, surpassing GPT-4o & o1. New benchmark tests full agent capabilities beyond single tasks.

Apr 03, 2025

∙ Paid

"AI Disruption" Publication 5500 Subscriptions 20% Discount Offer Link.

OpenAI acknowledges that Claude is the best.

The newly open-sourced benchmark PaperBench pits six cutting-edge large model-driven agents against each other to reproduce top-tier AI conference papers, with the new Claude-3.5-Sonnet significantly surpassing o1/r1 to take first place.

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.