Kimi Agent Achieves New SOTA on "Humanity's Last Exam"

Moonshot AI's Kimi-Researcher achieves 26.9% SOTA on Humanity's Last Exam using end-to-end agent reinforcement learning, outperforming models like o3.

Jun 23, 2025

∙ Paid

"AI Disruption" Publication 6900 Subscriptions 20% Discount Offer Link.

Kimi Researcher: Reasons better than OpenAI o3 - YouTube

Yesterday, Moonshot AI released the autonomous Agent Kimi-Researcher.

This Agent excels in multi-round search and reasoning, performing an average of 23 reasoning steps per task and accessing over 200 websites.

It is built on an internal version of the Kimi k-series model and trained entirely through end-to-end agent reinforcement learning, making it one of the few Agents built on a proprietary model.

In "Humanity's Last Exam" (HLE), Kimi-Researcher achieved a 26.9% Pass@1 score, setting a new state-of-the-art (SOTA) benchmark, with a Pass@4 accuracy of 40.17%. Starting from an initial 8.6% HLE score, Kimi-Researcher improved to 26.9% almost entirely through end-to-end reinforcement learning, demonstrating the immense potential of this approach in enhancing Agent intelligence.

Kimi-Researcher also performed exceptionally in multiple complex and highly challenging real-world benchmarks.

On xbench (a new dynamic, professionally aligned suite designed to integrate AI capabilities with practical productivity), Kimi-Researcher achieved an average Pass@1 score of 69% on the xbench-DeepSearch subtask (average of 4 runs), surpassing models like o3 with search tools.

In benchmarks like multi-round search reasoning (e.g., FRAMES, Seal-0) and factual information retrieval (e.g., SimpleQA), Kimi-Researcher also delivered outstanding results.

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.