Gemini 3 Launches Agentic Vision in Response to DeepSeek-OCR2

Google DeepMind's Agentic Vision in Gemini 3 Flash revolutionizes image understanding with a Think-Act-Observe loop and Python code execution for zooming, annotating, and visual analysis.

Jan 28, 2026

∙ Paid

“AI Disruption” Publication 8600 Subscriptions 20% Discount Offer Link.

You didn’t expect this, did you? Google DeepMind has just rolled out a heavyweight new capability for Gemini 3 Flash: Agentic Vision.

Could it be that they were provoked by DeepSeek-OCR2?

As you can see, this technology has completely transformed the way large language models understand the world:

From the past method of “guessing” to today’s “in-depth investigation.”

This capability was launched by the Google DeepMind team. Core product manager Rohan Doshi stated that traditional AI models, when processing images, usually just take a static glance.

If the details in the image are too small—like a serial number on a microchip or a blurry road sign in the distance—the model often has no choice but to “guess.”

But Agentic Vision introduces a “Think-Act-Observe” closed loop:

The model is no longer passively receiving pixels; instead, it actively writes Python code to manipulate the image based on the user’s needs.

This capability has directly enabled Gemini 3 Flash to achieve a 5% to 10% performance leap across various vision benchmarks.

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.