AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
Muon Optimizer: 48% Less Compute Than AdamW, Compatible with DeepSeek
Copy link
Facebook
Email
Notes
More

Muon Optimizer: 48% Less Compute Than AdamW, Compatible with DeepSeek

Discover how Dark Side of the Moon's improved Muon optimizer cuts computational requirements by 48% over AdamW, scales to larger models, and is compatible with DeepSeek.

Meng Li's avatar
Meng Li
Feb 23, 2025
∙ Paid
2

Share this post

AI Disruption
AI Disruption
Muon Optimizer: 48% Less Compute Than AdamW, Compatible with DeepSeek
Copy link
Facebook
Email
Notes
More
2
Share

"AI Disruption" publication New Year 30% discount link.


Kimi Drops Moonlight 16B MoE with Muon Optimizer - Install Locally

Computational Requirements Decreased by 48% Compared to AdamW: The training optimization algorithm Muon, proposed by OpenAI engineers, has been further advanced by the Dark Side of the Moon team!

The team discovered the Scaling Law of the Muon method, made improvements and proved that Muon is also applicable to larger models.

For different Llama architecture models with up to 1.5B parameters, the computational requirement of the improved Muon is only 52% of Adam W's.

Meanwhile, the team also trained a 16B MoE model based on the DeepSeek architecture, which has been open-sourced along with the improved optimization algorithm.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More