[2D Parallelism] Tracking feasibility

### Background

ZeRO-DP (ZeRO Data Parallel) and PP (Pipeline Parallelism) provide each a great memory saving over multiple GPUs. Each 1D allows for a much more efficient utilization of the gpu memory, but it's still not enough for very big models - sometimes not even feasible with any of the existing hardware. e.g. a model that's 45GB-big with just model params (t5-11b) can't fit even on a 40GB GPU.

The next stage in Model Parallelism that can enable loading bigger models onto smaller hardware is 2D Parallelism. That's combining Pipeline Parallelism (PP) with ZeRO-DP. 

3D Parallelism is possible too and it requires adding a horizontal MP (ala [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with a relatively low hanging fruit of 2D.

------------------

### Tracking

We have 3 implementations that provide the required components to build 2D Parallelism:

1. DeepSpeed (**DS**)
2. FairScale (**FS**)
3. PyTorch (native) (**PT**)

and the purpose of this issue is to track the feasibility/status/inter-operability in each one of them. And also which parts have been back-ported to PyTorch core.

Plus it tracks the status of where transformers models are at with regards to the above 3 implementations.

The 2 main questions are:

1. native 2D: how do we integrate a native PP with native ZeRO-DP (sharded) (e.g. can fairscale PP work with fairscale ZeRO-DP)
2. inter-operability 2D: is there a chance one implementation of PP/ZeRO-DP could work with one or both others ZeRO-DP/PP (e.g. can fairscale PP work with DeepSpeed ZeRO-DP).

------------------

### Notes

* 3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with low hanging fruit of 2D.

* MPU = Model Parallel Unit - a little helper module that helps each 1D to know which gpu groups it can use for PP, which for MP, which for DP. So that one 1D doesn't interfere with another 1D. e.g. in the case of 4 gpus and PP+DP, one may want:
   ```
         pp
   dp0 [0, 1]
   dp1 [2, 3] 
   ```
   
   So here there are 2 pipelines: 0-1, and 2-3, and DP sees gpus 0 and 2 as the entry points.

-------------------------- 

### TLDR

ZeRO-DP / PP inter-operability status

|    | DS | FS | PT |
|----|----|----|----|
| DS |  :heavy_check_mark:  | :question:  |  :x: |
| FS |  :question: | :question:  |  :question: |
| PT | :x:| :question:  |  :question: |


-------------------------- 

### 1. DeepSpeed

1D status:
* [x] [PP](https://www.deepspeed.ai/tutorials/pipeline/) 
* [x] [ZeRO-DP](https://www.deepspeed.ai/tutorials/zero/)

2D native status:
* [ ] :question: native PP + ZeRO-DP - untested yet, as it requires porting transformers to native PP first 

2D inter-operability status:
- [ ] :x: pytorch PP + DeepSpeed ZeRO-DP. I tried using pytorch PP with DeepSpeed ZeRO-DP and couldn't figure out how to make it work: https://github.com/microsoft/DeepSpeed/issues/710
- [ ] :question: fairscale PP + DeepSpeed ZeRO-DP  (unknown)

Important components:
 * [original megatron-lm MPU](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM/mpu/initialize.py)
 * [WIP DeepSpeed MPU](https://github.com/jeffra/DSE/blob/megatron-deepspeed-pipeline/megatron/mpu/initialize.py)

-------------------------- 

### 2. FairScale

Just started gather information on this one - will update once I have it.

1D status:
* [x] [PP](https://fairscale.readthedocs.io/en/latest/tutorials/pipe.html)
* [x] [ZeRO-DP](https://fairscale.readthedocs.io/en/latest/tutorials/oss.html)

2D native status:
* [ ] :question: native PP + ZeRO-DP - gathering info https://github.com/facebookresearch/fairscale/issues/351

2D inter-operability status:
- [ ] :question:  pytorch PP + fairscale ZeRO-DP gathering info
- [ ] :question:  DeepSpeed PP + fairscale ZeRO-DP gathering info

Important components:
* [MPU](https://github.com/facebookresearch/fairscale/blob/master/fairscale/nn/model_parallel/initialize.py#L41)


-------------------------- 


### 3. PyTorch

pytorch has been integrating from what I understand primarily fairscale version into its core.

1D status:
* [x] [PP](https://pytorch.org/docs/master/pipeline.html) - experimental support. have PoC t5 working: https://github.com/huggingface/transformers/pull/9765 [example](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py) 
* [ ] ZeRO-DP - plans to implement that (primarily integrating fairscale implementation)

2D native status:
- [ ] :grey_exclamation: native PP + ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)

2D inter-operability status:
- [ ] :grey_exclamation: DeepSpeed PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
- [ ] :grey_exclamation: fairscale PP + Pytorch ZeRO-DP  (Pytorch ZeRO-DP doesn't exists yet)

Important components:
* MPU:  ?

Ported components:
* ZeRO-DP stage 1: ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper https://github.com/pytorch/pytorch/pull/46750


Issues to track:
* The main discussion around integrating Deepspeed ZeRO into pytorch core: https://github.com/pytorch/pytorch/issues/42849



--------------------

### Transformers

To make 2D Parallelism working we need to of course support all these stages in `transformers`, so here is a status on what we have working or what is a work in progress. Some components (like bart-mp) work but are unmerged since we are still unsure how to move forward project-wide.

* ZeRO-DP
   - [x] works across all models with fairscale and DeepSpeed integrated.

* Naive vertical MP (aka PP w/ a single stage)
   - [x] t5
   - [x] gpt2
   - [ ] bart - unmerged https://github.com/huggingface/transformers/pull/9384

* Pytorch PP
   - [ ] t5 - unmerged https://github.com/huggingface/transformers/pull/9765

* Horizontal MP - unresearched!



	DS	FS	PT
DS	✔️	❓	❌
FS	❓	❓	❓
PT	❌	❓	❓

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2D Parallelism] Tracking feasibility #9931

Background

Tracking

Notes

TLDR

1. DeepSpeed

2. FairScale

3. PyTorch

Transformers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[2D Parallelism] Tracking feasibility #9931

Description

Background

Tracking

Notes

TLDR

1. DeepSpeed

2. FairScale

3. PyTorch

Transformers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions