DeepSeek Releases DeepGEMM: 300 Lines of Code Accelerate V3 & R1, R2 Expected Before May

DeepSeek unveils DeepGEMM, an FP8 GEMM library accelerating V3/R1 performance with 300 lines of code. Expect R2 model release before May for enhanced AI capabilities.

Feb 26, 2025

∙ Paid

"AI Disruption" publication New Year 30% discount link.

Applicable to both standard AI models and MoE.

DeepSeek’s Open Source Week has entered its third day (see the previous two days’ coverage at the end of this article in “Related Reading”).

Today's open-source project is DeepGEMM, an FP8 GEMM library that supports a dense Mixture of Expert (MoE) GEMM. It provides support for training and inference on V3/R1 and achieves a computing performance of over 1350+ FP8 TFLOPS on the Hopper GPU.

Specifically, DeepGEMM is a library that aims to achieve efficient and simplified FP8 General Matrix Multiply (GEMM) by utilizing the fine-grained scaling technique introduced in DeepSeek-V3.

The library supports standard GEMM as well as MoE-grouped GEMM. It is written in CUDA, and during installation, there is no need to compile the code manually; instead, a lightweight Just-In-Time (JIT) module compiles all kernels at runtime.

AI Disruption

DeepSeek Releases DeepGEMM: 300 Lines of Code Accelerate V3 & R1, R2 Expected Before May

DeepSeek unveils DeepGEMM, an FP8 GEMM library accelerating V3/R1 performance with 300 lines of code. Expect R2 model release before May for enhanced AI capabilities.

This post is for paid subscribers