MoBA Attention by Kimi Yang: DeepSeek NSA Collision & Code Release
Discover the MoBA attention mechanism, an advanced approach combining MoE and FlashAttention for efficient long-sequence processing in large language models.
"AI Disruption" publication New Year 30% discount link.
The day before yesterday afternoon, DeepSeek published a new paper proposing an improved attention mechanism, NSA. With the personal involvement of its founder and CEO, Liang Wenfeng, the release immediately attracted widespread attention.
Refer to my article for more details.
However, on the very same day, a similar paper was also released by "The Dark Side of the Moon," and coincidentally, the founder and CEO of The Dark Side of the Moon, Yang Zhilin, was also one of the co-authors of this paper.
Unlike DeepSeek, which only released a paper, The Dark Side of the Moon also published a related code. This code has been practically deployed and validated for a year, ensuring its effectiveness and robustness.
This paper proposes an attention mechanism called MoBA, or Mixture of Block Attention, which can be directly translated as "Block Attention Mixture."
MoBA is described as "an innovative approach that applies the Mixture of Experts (MoE) principle to the attention mechanism." This method follows the "less structure" principle and avoids introducing predefined biases, allowing the model to independently decide which positions to focus on.