AI Disruption

AI Disruption

Gemma 4 Delivers 3x Speedup, Day 0 Support for vLLM

Google Gemma 4 now runs 3x faster with MTP Drafter. Day 0 vLLM support, 2B–31B models, no quality loss.

Meng Li's avatar
Meng Li
May 07, 2026
∙ Paid

“AI Disruption” Publication 9700 Subscriptions 20% Discount Offer Link.


Multi-token-prediction in Gemma 4

Google just dropped another move yesterday — after releasing Gemma 4 in early April, they’ve now delivered a “plug-in” that makes inference up to 3x faster: the MTP Drafter.

The official announcement is short but powerful:
“Same quality, way more speed.”

What is Gemma 4?

Google Open-Sources Gemma 4, Beats 13x Larger Qwen3.5

Google Open-Sources Gemma 4, Beats 13x Larger Qwen3.5

Meng Li
·
Apr 3
Read full story

Key highlights:

  • Full parameter range from 2B to 31B — from phone-friendly E2B/E4B models all the way up to workstation-grade 31B Dense and 26B MoE.

  • True multimodal — supports text, image, video, and audio.

  • Strong reasoning — reaches 85%+ on MMLU Pro, placing it in the top tier of open-source models.

  • Insane adoption — over 60 million downloads in the first 4 weeks (per Google’s own data).

No matter how strong a model is, it’s useless if it can’t run fast. Today’s update is Google’s direct attack on the “running” part.

User's avatar

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.
© 2026 Meng Li · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture