AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency

Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency

Discover how to enhance large model training with DeepSpeed. Learn techniques for efficient distributed training, compression, and more.

Meng Li's avatar
Meng Li
Aug 02, 2024
∙ Paid
1

Share this post

AI Disruption
AI Disruption
Deep Dive into DeepSpeed: Enhancing Large Model Training Efficiency
1
Share

Welcome to the "Practical Application of AI Large Language Model Systems" Series

Table of Contents

Table of Contents

Meng Li
·
June 7, 2024
Read full story

In the course "Building a 100M Parameter Transformer Model from Scratch," we built a Transformer from scratch and conducted a full training. Using an A10-24G GPU with 500M training text, the estimated training time was one month. This shows the high machine requirements for training. We used only 500M of data, but training a large model typically involves much more data and larger parameters.

As far as I know, training large models like GPT-3 and GLM-130B takes around 3 months. So, our current approach is not feasible.

How can we speed up training in practice?

The answer is distributed training. Popular frameworks include Microsoft's DeepSpeed and NVIDIA's NCCL. This course focuses on Microsoft's DeepSpeed.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share