YOLO Releases v12: The First YOLO Framework Centered on Attention
YOLOv12 introduces the first attention-based framework, optimizing real-time object detection with improved speed, accuracy, and efficiency, outperforming previous versions.
"AI Disruption" publication New Year 30% discount link.
The structural innovation of the YOLO series models has always revolved around CNN, while the attention mechanism, which gives transformers their dominant advantage, has not been a focal point for improving the YOLO network structure.
The main reason for this is that the attention mechanism's speed cannot meet YOLO's real-time requirements. The release of YOLOv12 this Wednesday aims to change this situation and achieve superior performance.
Introduction
The main reason why the attention mechanism cannot be used as a core module in the YOLO framework is its inherent inefficiency, primarily due to two factors: (1) the computational complexity of attention grows quadratically; (2) the memory access operations of attention are inefficient (the latter is what FlashAttention mainly addresses). With the same computational budget, CNN-based architectures are about 2-3 times faster than attention-based architectures, which severely limits the application of attention in YOLO systems, as YOLO relies heavily on high inference speed.