AI Disruption

AI Disruption

YOLO Releases v12: The First YOLO Framework Centered on Attention

YOLOv12 introduces the first attention-based framework, optimizing real-time object detection with improved speed, accuracy, and efficiency, outperforming previous versions.

Meng Li's avatar
Meng Li
Feb 22, 2025
∙ Paid

"AI Disruption" publication New Year 30% discount link.


The structural innovation of the YOLO series models has always revolved around CNN, while the attention mechanism, which gives transformers their dominant advantage, has not been a focal point for improving the YOLO network structure.

The main reason for this is that the attention mechanism's speed cannot meet YOLO's real-time requirements. The release of YOLOv12 this Wednesday aims to change this situation and achieve superior performance.

Introduction

The main reason why the attention mechanism cannot be used as a core module in the YOLO framework is its inherent inefficiency, primarily due to two factors: (1) the computational complexity of attention grows quadratically; (2) the memory access operations of attention are inefficient (the latter is what FlashAttention mainly addresses). With the same computational budget, CNN-based architectures are about 2-3 times faster than attention-based architectures, which severely limits the application of attention in YOLO systems, as YOLO relies heavily on high inference speed.

User's avatar

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.
© 2026 Meng Li · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture