-
Notifications
You must be signed in to change notification settings - Fork 30.1k
Description
Background
ZeRO-DP (ZeRO Data Parallel) and PP (Pipeline Parallelism) provide each a great memory saving over multiple GPUs. Each 1D allows for a much more efficient utilization of the gpu memory, but it's still not enough for very big models - sometimes not even feasible with any of the existing hardware. e.g. a model that's 45GB-big with just model params (t5-11b) can't fit even on a 40GB GPU.
The next stage in Model Parallelism that can enable loading bigger models onto smaller hardware is 2D Parallelism. That's combining Pipeline Parallelism (PP) with ZeRO-DP.
3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM, but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with a relatively low hanging fruit of 2D.
Tracking
We have 3 implementations that provide the required components to build 2D Parallelism:
- DeepSpeed (DS)
- FairScale (FS)
- PyTorch (native) (PT)
and the purpose of this issue is to track the feasibility/status/inter-operability in each one of them. And also which parts have been back-ported to PyTorch core.
Plus it tracks the status of where transformers models are at with regards to the above 3 implementations.
The 2 main questions are:
- native 2D: how do we integrate a native PP with native ZeRO-DP (sharded) (e.g. can fairscale PP work with fairscale ZeRO-DP)
- inter-operability 2D: is there a chance one implementation of PP/ZeRO-DP could work with one or both others ZeRO-DP/PP (e.g. can fairscale PP work with DeepSpeed ZeRO-DP).
Notes
-
3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with low hanging fruit of 2D.
-
MPU = Model Parallel Unit - a little helper module that helps each 1D to know which gpu groups it can use for PP, which for MP, which for DP. So that one 1D doesn't interfere with another 1D. e.g. in the case of 4 gpus and PP+DP, one may want:
pp dp0 [0, 1] dp1 [2, 3]
So here there are 2 pipelines: 0-1, and 2-3, and DP sees gpus 0 and 2 as the entry points.
TLDR
ZeRO-DP / PP inter-operability status
DS | FS | PT | |
---|---|---|---|
DS | ✔️ | ❓ | ❌ |
FS | ❓ | ❓ | ❓ |
PT | ❌ | ❓ | ❓ |
1. DeepSpeed
1D status:
2D native status:
- ❓ native PP + ZeRO-DP - untested yet, as it requires porting transformers to native PP first
2D inter-operability status:
- ❌ pytorch PP + DeepSpeed ZeRO-DP. I tried using pytorch PP with DeepSpeed ZeRO-DP and couldn't figure out how to make it work: stuck in trying to combine PP with DeepSpeed deepspeedai/DeepSpeed#710
- ❓ fairscale PP + DeepSpeed ZeRO-DP (unknown)
Important components:
2. FairScale
Just started gather information on this one - will update once I have it.
1D status:
2D native status:
- ❓ native PP + ZeRO-DP - gathering info How to integrate 2D Parallelism: PP + ZeRO-DP? facebookresearch/fairscale#351
2D inter-operability status:
- ❓ pytorch PP + fairscale ZeRO-DP gathering info
- ❓ DeepSpeed PP + fairscale ZeRO-DP gathering info
Important components:
3. PyTorch
pytorch has been integrating from what I understand primarily fairscale version into its core.
1D status:
- PP - experimental support. have PoC t5 working: [wip] [pipeline parallel] t5 - experiment #9765 example
- ZeRO-DP - plans to implement that (primarily integrating fairscale implementation)
2D native status:
- ❕ native PP + ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
2D inter-operability status:
- ❕ DeepSpeed PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
- ❕ fairscale PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
Important components:
- MPU: ?
Ported components:
- ZeRO-DP stage 1: ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper pytorch/pytorch#46750
Issues to track:
- The main discussion around integrating Deepspeed ZeRO into pytorch core: [RFC] DeepSpeed + PT Distributed Integration pytorch/pytorch#42849
Transformers
To make 2D Parallelism working we need to of course support all these stages in transformers
, so here is a status on what we have working or what is a work in progress. Some components (like bart-mp) work but are unmerged since we are still unsure how to move forward project-wide.
-
ZeRO-DP
- works across all models with fairscale and DeepSpeed integrated.
-
Naive vertical MP (aka PP w/ a single stage)
- t5
- gpt2
- bart - unmerged [model parallelism] Bart goes parallel #9384
-
Pytorch PP
- t5 - unmerged [wip] [pipeline parallel] t5 - experiment #9765
-
Horizontal MP - unresearched!