Muon Optimizer: 48% Less Compute Than AdamW, Compatible with DeepSeek

Discover how Dark Side of the Moon's improved Muon optimizer cuts computational requirements by 48% over AdamW, scales to larger models, and is compatible with DeepSeek.

Feb 23, 2025

∙ Paid

"AI Disruption" publication New Year 30% discount link.

Kimi Drops Moonlight 16B MoE with Muon Optimizer - Install Locally

Computational Requirements Decreased by 48% Compared to AdamW: The training optimization algorithm Muon, proposed by OpenAI engineers, has been further advanced by the Dark Side of the Moon team!

The team discovered the Scaling Law of the Muon method, made improvements and proved that Muon is also applicable to larger models.

For different Llama architecture models with up to 1.5B parameters, the computational requirement of the improved Muon is only 52% of Adam W's.

Meanwhile, the team also trained a 16B MoE model based on the DeepSeek architecture, which has been open-sourced along with the improved optimization algorithm.

AI Disruption

Muon Optimizer: 48% Less Compute Than AdamW, Compatible with DeepSeek

Discover how Dark Side of the Moon's improved Muon optimizer cuts computational requirements by 48% over AdamW, scales to larger models, and is compatible with DeepSeek.

This post is for paid subscribers