|
1 |
| -Combining Distributed DataParallel with Distributed RPC Framework |
| 1 | +๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DDP)๊ณผ ๋ถ์ฐ RPC ํ๋ ์์ํฌ ๊ฒฐํฉ |
2 | 2 | =================================================================
|
3 |
| -**Authors**: `Pritam Damania <https://github.com/pritamdamania87>`_ and `Yi Wang <https://github.com/SciPioneer>`_ |
4 |
| - |
5 |
| - |
6 |
| -This tutorial uses a simple example to demonstrate how you can combine |
7 |
| -`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP) |
8 |
| -with the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
9 |
| -to combine distributed data parallelism with distributed model parallelism to |
10 |
| -train a simple model. Source code of the example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__. |
11 |
| - |
12 |
| -Previous tutorials, |
13 |
| -`Getting Started With Distributed Data Parallel <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__ |
14 |
| -and `Getting Started with Distributed RPC Framework <https://tutorials.pytorch.kr/intermediate/rpc_tutorial.html>`__, |
15 |
| -described how to perform distributed data parallel and distributed model |
16 |
| -parallel training respectively. Although, there are several training paradigms |
17 |
| -where you might want to combine these two techniques. For example: |
18 |
| - |
19 |
| -1) If we have a model with a sparse part (large embedding table) and a dense |
20 |
| - part (FC layers), we might want to put the embedding table on a parameter |
21 |
| - server and replicate the FC layer across multiple trainers using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
22 |
| - The `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
23 |
| - can be used to perform embedding lookups on the parameter server. |
24 |
| -2) Enable hybrid parallelism as described in the `PipeDream <https://arxiv.org/abs/1806.03377>`__ paper. |
25 |
| - We can use the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ |
26 |
| - to pipeline stages of the model across multiple workers and replicate each |
27 |
| - stage (if needed) using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
| 3 | +**์ ์**: `Pritam Damania <https://github.com/pritamdamania87>`__ and `Yi Wang <https://github.com/SciPioneer>`__ |
| 4 | + |
| 5 | +**๋ฒ์ญ**: `๋ฐ๋ค์ <https://github.com/dajeongPark-dev>`_ |
| 6 | + |
| 7 | + |
| 8 | +์ด ํํ ๋ฆฌ์ผ์ ๊ฐ๋จํ ์์ ๋ฅผ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ(distributed data parallelism)์ |
| 9 | +๋ถ์ฐ ๋ชจ๋ธ ๋ณ๋ ฌ ์ฒ๋ฆฌ(distributed model parallelism)๋ฅผ ๊ฒฐํฉํ์ฌ ๊ฐ๋จํ ๋ชจ๋ธ ํ์ต์ํฌ ๋ |
| 10 | +`๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ(DistributedDataParallel) <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP)๊ณผ |
| 11 | +`๋ถ์ฐ RPC ํ๋ ์์ํฌ(Distributed RPC framework) <https://pytorch.org/docs/master/rpc.html>`__๋ฅผ ๊ฒฐํฉํ๋ ๋ฐฉ๋ฒ์ ๋ํด ์ค๋ช
ํฉ๋๋ค. |
| 12 | +์์ ์ ์์ค ์ฝ๋๋ `์ฌ๊ธฐ <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__์์ ํ์ธํ ์ ์์ต๋๋ค. |
| 13 | + |
| 14 | +์ด์ ํํ ๋ฆฌ์ผ ๋ด์ฉ์ด์๋ |
| 15 | +`๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ์์ํ๊ธฐ <https://tutorials.pytorch.kr/intermediate/ddp_tutorial.html>`__์ |
| 16 | +`๋ถ์ฐ RPC ํ๋ ์์ํฌ ์์ํ๊ธฐ <https://tutorials.pytorch.kr/intermediate/rpc_tutorial.html>`__๋ |
| 17 | +๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ ๋ฐ ๋ถ์ฐ ๋ชจ๋ธ ๋ณ๋ ฌ ํ์ต์ ๊ฐ๊ฐ ์ํํ๋ ๋ฐฉ๋ฒ์ ๋ํด ์ค๋ช
ํฉ๋๋ค. |
| 18 | +๊ทธ๋ฌ๋ ์ด ๋ ๊ฐ์ง ๊ธฐ์ ์ ๊ฒฐํฉํ ์ ์๋ ๋ช ๊ฐ์ง ํ์ต ํจ๋ฌ๋ค์์ด ์์ต๋๋ค. ์๋ฅผ ๋ค์ด: |
| 19 | + |
| 20 | +1) ํฌ์ ๋ถ๋ถ(ํฐ ์๋ฒ ๋ฉ ํ
์ด๋ธ)๊ณผ ๋ฐ์ง ๋ถ๋ถ(FC ๋ ์ด์ด)์ด ์๋ ๋ชจ๋ธ์ด ์๋ ๊ฒฝ์ฐ, |
| 21 | + ๋งค๊ฐ๋ณ์ ์๋ฒ(parameter server)์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(embedding table)์ ๋๊ณ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ ์ฌ์ฉํ์ฌ |
| 22 | + ์ฌ๋ฌ ํธ๋ ์ด๋์ ๊ฑธ์ณ FC ๋ ์ด์ด๋ฅผ ๋ณต์ ํ๋ ๊ฒ์ ์ํ ์๋ ์์ต๋๋ค. |
| 23 | + ์ด๋ `๋ถ์ฐ RPC ํ๋ ์์ํฌ <https://pytorch.org/docs/master/rpc.html>`__๋ |
| 24 | + ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ๋ ๋ฐ ์ฌ์ฉํ ์ ์์ต๋๋ค. |
| 25 | +2) ๋ค์์ `PipeDream <https://arxiv.org/abs/1806.03377>`__ ๋ฌธ์์์ ์ค๋ช
๋ ํ์ด๋ธ๋ฆฌ๋ ๋ณ๋ ฌ ์ฒ๋ฆฌ ํ์ฑํํ๊ธฐ ์
๋๋ค. |
| 26 | + `๋ถ์ฐ RPC ํ๋ ์์ํฌ <https://pytorch.org/docs/master/rpc.html>`__๋ฅผ ์ฌ์ฉํ์ฌ |
| 27 | + ์ฌ๋ฌ worker์ ๊ฑธ์ณ ๋ชจ๋ธ์ ๋จ๊ณ๋ฅผ ํ์ดํ๋ผ์ธ(pipeline)ํ ์ ์๊ณ |
| 28 | + (ํ์์ ๋ฐ๋ผ) `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ ์ด์ฉํด์ |
| 29 | + ๊ฐ ๋จ๊ณ๋ฅผ ๋ณต์ ํ ์ ์์ต๋๋ค. |
28 | 30 |
|
29 | 31 | |
|
30 |
| -In this tutorial we will cover case 1 mentioned above. We have a total of 4 |
31 |
| -workers in our setup as follows: |
| 32 | +์ด ํํ ๋ฆฌ์ผ์์๋ ์์์ ์ธ๊ธํ ์ฒซ ๋ฒ์งธ ๊ฒฝ์ฐ๋ฅผ ๋ค๋ฃฐ ๊ฒ์
๋๋ค. |
| 33 | +๋ค์๊ณผ ๊ฐ์ด ์ด 4๊ฐ์ worker๊ฐ ์์ต๋๋ค: |
32 | 34 |
|
33 | 35 |
|
34 |
| -1) 1 Master, which is responsible for creating an embedding table |
35 |
| - (nn.EmbeddingBag) on the parameter server. The master also drives the |
36 |
| - training loop on the two trainers. |
37 |
| -2) 1 Parameter Server, which basically holds the embedding table in memory and |
38 |
| - responds to RPCs from the Master and Trainers. |
39 |
| -3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst |
40 |
| - themselves using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__. |
41 |
| - The trainers are also responsible for executing the forward pass, backward |
42 |
| - pass and optimizer step. |
| 36 | +1) 1๊ฐ์ ๋ง์คํฐ๋ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(nn.EmbeddingBag) ์์ฑ์ ๋ด๋นํฉ๋๋ค. |
| 37 | + ๋ํ ๋ง์คํฐ๋ ๋ ํธ๋ ์ด๋์ ํ์ต ๋ฃจํ๋ฅผ ์ํํฉ๋๋ค. |
| 38 | +2) 1๊ฐ์ ๋งค๊ฐ๋ณ์ ์๋ฒ๋ ๊ธฐ๋ณธ์ ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ณด์ ํ๊ณ ๋ง์คํฐ ๋ฐ ํธ๋ ์ด๋์ RPC์ ์๋ตํฉ๋๋ค. |
| 39 | +3) 2๊ฐ์ ํธ๋ ์ด๋๋ `๋ถ์ฐ ๋ฐ์ดํฐ ๋ณ๋ ฌ <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__์ |
| 40 | + ์ฌ์ฉํ์ฌ ์์ฒด์ ์ผ๋ก ๋ณต์ ๋๋ FC ๋ ์ด์ด(nn.Linear)๋ฅผ ์ ์ฅํฉ๋๋ค. |
| 41 | + ํธ๋ ์ด๋๋ ๋ํ ์๋ฐฉํฅ ์ ๋ฌ(forward pass), ์ญ๋ฐฉํฅ ์ ๋ฌ(backward pass) ๋ฐ ์ต์ ํ ๋จ๊ณ๋ฅผ ์คํํด์ผ ํฉ๋๋ค. |
43 | 42 |
|
44 | 43 | |
|
45 |
| -The entire training process is executed as follows: |
46 |
| - |
47 |
| -1) The master creates a `RemoteModule <https://pytorch.org/docs/master/rpc.html#remotemodule>`__ |
48 |
| - that holds an embedding table on the Parameter Server. |
49 |
| -2) The master, then kicks off the training loop on the trainers and passes the |
50 |
| - remote module to the trainers. |
51 |
| -3) The trainers create a ``HybridModel`` which first performs an embedding lookup |
52 |
| - using the remote module provided by the master and then executes the |
53 |
| - FC layer which is wrapped inside DDP. |
54 |
| -4) The trainer executes the forward pass of the model and uses the loss to |
55 |
| - execute the backward pass using `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__. |
56 |
| -5) As part of the backward pass, the gradients for the FC layer are computed |
57 |
| - first and synced to all trainers via allreduce in DDP. |
58 |
| -6) Next, Distributed Autograd propagates the gradients to the parameter server, |
59 |
| - where the gradients for the embedding table are updated. |
60 |
| -7) Finally, the `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__ is used to update all the parameters. |
61 |
| - |
62 |
| - |
63 |
| -.. attention:: |
64 |
| - |
65 |
| - You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__ |
66 |
| - for the backward pass if you're combining DDP and RPC. |
67 |
| - |
68 |
| - |
69 |
| -Now, let's go through each part in detail. Firstly, we need to setup all of our |
70 |
| -workers before we can perform any training. We create 4 processes such that |
71 |
| -ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the |
72 |
| -parameter server. |
73 |
| - |
74 |
| -We initialize the RPC framework on all 4 workers using the TCP init_method. |
75 |
| -Once RPC initialization is done, the master creates a remote module that holds an `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__ |
76 |
| -layer on the Parameter Server using `RemoteModule <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule>`__. |
77 |
| -The master then loops through each trainer and kicks off the training loop by |
78 |
| -calling ``_run_trainer`` on each trainer using `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__. |
79 |
| -Finally, the master waits for all training to finish before exiting. |
80 |
| - |
81 |
| -The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2 |
82 |
| -(for two trainers) using `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__. |
83 |
| -Next, they initialize the RPC framework using the TCP init_method. Note that |
84 |
| -the ports are different in RPC initialization and ProcessGroup initialization. |
85 |
| -This is to avoid port conflicts between initialization of both frameworks. |
86 |
| -Once the initialization is done, the trainers just wait for the ``_run_trainer`` |
87 |
| -RPC from the master. |
88 |
| - |
89 |
| -The parameter server just initializes the RPC framework and waits for RPCs from |
90 |
| -the trainers and master. |
| 44 | +์ ์ฒด์ ์ธ ํ์ต๊ณผ์ ์ ๋ค์๊ณผ ๊ฐ์ด ์คํ๋ฉ๋๋ค: |
| 45 | + |
| 46 | +1) ๋ง์คํฐ๋ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ด๊ณ ์๋ |
| 47 | + `์๊ฒฉ ๋ชจ๋(RemoteModule) <https://pytorch.org/docs/master/rpc.html#remotemodule>`__์ ์์ฑํฉ๋๋ค. |
| 48 | +2) ๊ทธ๋ฐ ๋ค์ ๋ง์คํฐ๋ ํธ๋ ์ด๋์ ํ์ต ๋ฃจํ๋ฅผ ์์ํ๊ณ ์๊ฒฉ ๋ชจ๋(remote module)์ ํธ๋ ์ด๋์๊ฒ ์ ๋ฌํฉ๋๋ค. |
| 49 | +3) ํธ๋ ์ด๋๋ ๋จผ์ ๋ง์คํฐ์์ ์ ๊ณตํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ |
| 50 | + ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ ๋ค์ DDP ๋ด๋ถ์ ๊ฐ์ธ์ง FC ๋ ์ด์ด๋ฅผ ์คํํ๋ ``HybridModel``์ ์์ฑํฉ๋๋ค. |
| 51 | +4) ํธ๋ ์ด๋๋ ๋ชจ๋ธ์ ์๋ฐฉํฅ ์ ๋ฌ์ ์คํํ๊ณ ์์ค์ ์ฌ์ฉํ์ฌ `๋ถ์ฐ Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__๋ฅผ |
| 52 | + ์ฌ์ฉํ์ฌ ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์คํํฉ๋๋ค. |
| 53 | +5) ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์ผ๋ถ๋ก FC ๋ ์ด์ด์ ๋ณํ๋๊ฐ ๋จผ์ ๊ณ์ฐ๋๊ณ DDP์ allreduce๋ฅผ ํตํด ๋ชจ๋ ํธ๋ ์ด๋์ ๋๊ธฐํ๋ฉ๋๋ค. |
| 54 | +6) ๋ค์์ผ๋ก, ๋ถ์ฐ Autograd๋ ๋งค๊ฐ๋ณ์ ์๋ฒ๋ก ๋ณํ๋๋ฅผ ์ ํํ๊ณ ๊ทธ๊ณณ์์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ณํ๋๊ฐ ์
๋ฐ์ดํธ๋ฉ๋๋ค. |
| 55 | +7) ๋ง์ง๋ง์ผ๋ก, `๋ถ์ฐ ์ตํฐ๋ง์ด์ (DistributedOptimizer) <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__๋ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์
๋ฐ์ดํธํ๋ ๋ฐ ์ฌ์ฉ๋ฉ๋๋ค. |
| 56 | +
|
| 57 | +.. ์ฃผ์์ฌํญ:: |
| 58 | +
|
| 59 | + DDP์ RPC๋ฅผ ๊ฒฐํฉํ ๋, ์ญ๋ฐฉํฅ ์ ๋ฌ์ ๋ํด ํญ์ |
| 60 | + `๋ถ์ฐ Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__๋ฅผ ์ฌ์ฉํด์ผ ํฉ๋๋ค. |
| 61 | +
|
| 62 | +
|
| 63 | +์ด์ ๊ฐ ๋ถ๋ถ์ ์์ธํ ์ดํด๋ณด๊ฒ ์ต๋๋ค. |
| 64 | +๋จผ์ ํ์ต์ ์ํํ๊ธฐ ์ ์ ๋ชจ๋ ์์
์๋ฅผ ์ค์ ํด์ผ ํฉ๋๋ค. |
| 65 | +์์ 0๊ณผ 1์ ํธ๋ ์ด๋, ์์ 2๋ ๋ง์คํฐ, ์์ 3์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ธ 4๊ฐ์ ํ๋ก์ธ์ค๋ฅผ ๋ง๋ญ๋๋ค. |
| 66 | +
|
| 67 | +TCP init_method๋ฅผ ์ฌ์ฉํ์ฌ 4๊ฐ์ ๋ชจ๋ worker์์ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํฉ๋๋ค. |
| 68 | +RPC ์ด๊ธฐํ๊ฐ ๋๋๋ฉด, ๋ง์คํฐ๋ `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__ ๋ ์ด์ด๋ฅผ |
| 69 | +`์๊ฒฉ ๋ชจ๋(RemoteModule) <https://pytorch.org/docs/master/rpc.html#remotemodule>`__์ ์ฌ์ฉํ์ฌ |
| 70 | +๋งค๊ฐ๋ณ์ ์๋ฒ์ ๋ด๊ณ ์๋ ์๊ฒฉ ๋ชจ๋ ํ๋๋ฅผ ์์ฑํฉ๋๋ค. |
| 71 | +๊ทธ๋ฐ ๋ค์ ๋ง์คํฐ๋ ๊ฐ ํธ๋ ์ด๋๋ฅผ ๋ฐ๋ณตํ๊ณ `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__๋ฅผ |
| 72 | +์ฌ์ฉํ์ฌ ๊ฐ ํธ๋ ์ด๋์์ ``_run_trainer``๋ฅผ ํธ์ถํ์ฌ ๋ฐ๋ณต ํ์ต์ ์์ํฉ๋๋ค. |
| 73 | +๋ง์ง๋ง์ผ๋ก ๋ง์คํฐ๋ ์ข
๋ฃํ๊ธฐ ์ ์ ๋ชจ๋ ํ์ต์ด ์๋ฃ๋ ๋๊น์ง ๊ธฐ๋ค๋ฆฝ๋๋ค. |
| 74 | +
|
| 75 | +ํธ๋ ์ด๋๋ `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__์ ์ฌ์ฉํ์ฌ |
| 76 | +(2๊ฐ์ ํธ๋ ์ด๋) world_size=2๋ก DDP๋ฅผ ์ํด ``ProcessGroup``์ ์ด๊ธฐํํฉ๋๋ค. |
| 77 | +๋ค์์ผ๋ก TCP init_method๋ฅผ ์ฌ์ฉํ์ฌ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํฉ๋๋ค. |
| 78 | +์ฌ๊ธฐ์ ์ฃผ์ ํ ์ ์ RPC ์ด๊ธฐํ์ ProgressGroup ์ด๊ธฐํ์์ ์ฐ์ด๋ ํฌํธ(port)๊ฐ ๋ค๋ฅด๋ค๋ ๊ฒ์
๋๋ค. |
| 79 | +์ด๋ ๋ ํ๋ ์์ํฌ์ ์ด๊ธฐํ ๊ฐ์ ํฌํธ ์ถฉ๋์ ํผํ๊ธฐ ์ํด์ ์
๋๋ค. |
| 80 | +์ด๊ธฐํ๊ฐ ์๋ฃ๋๋ฉด ํธ๋ ์ด๋๋ ๋ง์คํฐ์ ``_run_trainer` RPC๋ฅผ ๊ธฐ๋ค๋ฆฌ๊ธฐ๋ง ํ๋ฉด ๋ฉ๋๋ค. |
| 81 | +
|
| 82 | +ํ๋ผํผํฐ ์๋ฒ๋ RPC ํ๋ ์์ํฌ๋ฅผ ์ด๊ธฐํํ๊ณ ํธ๋ ์ด๋์ ๋ง์คํฐ์ RPC๋ฅผ ๊ธฐ๋ค๋ฆฝ๋๋ค. |
91 | 83 |
|
92 | 84 |
|
93 | 85 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
|
94 | 86 | :language: py
|
95 | 87 | :start-after: BEGIN run_worker
|
96 | 88 | :end-before: END run_worker
|
97 | 89 |
|
98 |
| -Before we discuss details of the Trainer, let's introduce the ``HybridModel`` that |
99 |
| -the trainer uses. As described below, the ``HybridModel`` is initialized using a |
100 |
| -remote module that holds an embedding table (``remote_emb_module``) on the parameter server and the ``device`` |
101 |
| -to use for DDP. The initialization of the model wraps an |
102 |
| -`nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__ |
103 |
| -layer inside DDP to replicate and synchronize this layer across all trainers. |
| 90 | +ํธ๋ ์ด๋์ ๋ํ ์์ธํ ์ค๋ช
์ ์์, ํธ๋ ์ด๋๊ฐ ์ฌ์ฉํ๋ ``HybridModel``์ ๋ํด ์ค๋ช
๋๋ฆฌ๊ฒ ์ต๋๋ค. |
| 91 | +์๋์ ์ค๋ช
๋ ๋๋ก ``HybridModel``์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ(``remote_emb_module``)๊ณผ DDP์ ์ฌ์ฉํ ``device``๋ฅผ ๋ณด์ ํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ ์ด๊ธฐํ๋ฉ๋๋ค. |
| 92 | +๋ชจ๋ธ ์ด๊ธฐํ๋ DDP ๋ด๋ถ์ `nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__ ๋ ์ด์ด๋ฅผ |
| 93 | +๊ฐ์ธ ๋ชจ๋ ํธ๋ ์ด๋์์ ์ด ๋ ์ด์ด๋ฅผ ๋ณต์ ํ๊ณ ๋๊ธฐํํฉ๋๋ค. |
| 94 | +
|
104 | 95 |
|
105 |
| -The forward method of the model is pretty straightforward. It performs an |
106 |
| -embedding lookup on the parameter server using RemoteModule's ``forward`` |
107 |
| -and passes its output onto the FC layer. |
| 96 | +๋ชจ๋ธ์ ์๋ฐฉํฅ(forward) ํจ์๋ ๊ฝค ๊ฐ๋จํฉ๋๋ค. |
| 97 | +RemoteModule์ ``forward``๋ฅผ ์ฌ์ฉํ์ฌ ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ์ฐพ๊ธฐ ์์
(embedding lookup)์ ์ํํ๊ณ ๊ทธ ์ถ๋ ฅ์ FC ๋ ์ด์ด์ ์ ๋ฌํฉ๋๋ค. |
108 | 98 |
|
109 | 99 |
|
110 | 100 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
|
111 | 101 | :language: py
|
112 | 102 | :start-after: BEGIN hybrid_model
|
113 | 103 | :end-before: END hybrid_model
|
114 | 104 |
|
115 |
| -Next, let's look at the setup on the Trainer. The trainer first creates the |
116 |
| -``HybridModel`` described above using a remote module that holds the embedding table on the |
117 |
| -parameter server and its own rank. |
118 |
| - |
119 |
| -Now, we need to retrieve a list of RRefs to all the parameters that we would |
120 |
| -like to optimize with `DistributedOptimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__. |
121 |
| -To retrieve the parameters for the embedding table from the parameter server, |
122 |
| -we can call RemoteModule's `remote_parameters <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters>`__, |
123 |
| -which basically walks through all the parameters for the embedding table and returns |
124 |
| -a list of RRefs. The trainer calls this method on the parameter server via RPC |
125 |
| -to receive a list of RRefs to the desired parameters. Since the |
126 |
| -DistributedOptimizer always takes a list of RRefs to parameters that need to |
127 |
| -be optimized, we need to create RRefs even for the local parameters for our |
128 |
| -FC layers. This is done by walking ``model.fc.parameters()``, creating an RRef for |
129 |
| -each parameter and appending it to the list returned from ``remote_parameters()``. |
130 |
| -Note that we cannnot use ``model.parameters()``, |
131 |
| -because it will recursively call ``model.remote_emb_module.parameters()``, |
132 |
| -which is not supported by ``RemoteModule``. |
133 |
| - |
134 |
| -Finally, we create our DistributedOptimizer using all the RRefs and define a |
135 |
| -CrossEntropyLoss function. |
| 105 | +๋ค์์ผ๋ก ํธ๋ ์ด๋์ ์ค์ ์ ์ดํด๋ณด๊ฒ ์ต๋๋ค. |
| 106 | +ํธ๋ ์ด๋๋ ๋จผ์ ๋งค๊ฐ๋ณ์ ์๋ฒ์ ์๋ฒ ๋ฉ ํ
์ด๋ธ๊ณผ ์์ฒด ์์๋ฅผ ๋ณด์ ํ๋ ์๊ฒฉ ๋ชจ๋์ ์ฌ์ฉํ์ฌ |
| 107 | +์์์ ์ค๋ช
ํ ``HybridModel``์ ์์ฑํฉ๋๋ค. |
| 108 | +
|
| 109 | +์ด์ `๋ถ์ฐ ์ตํฐ๋ง์ด์ (DistributedOptimizer) <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__๋ก |
| 110 | +์ต์ ํํ๋ ค๋ ๋ชจ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ๊ฒ์ํด์ผ ํฉ๋๋ค. |
| 111 | +๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋งค๊ฐ๋ณ์๋ฅผ ๊ฒ์ํ๊ธฐ ์ํด |
| 112 | +RemoteModule์ `remote_parameters <https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters>`__๋ฅผ ํธ์ถํ ์ ์์ต๋๋ค. |
| 113 | +๊ทธ๋ฆฌ๊ณ ์ด๊ฒ์ ๊ธฐ๋ณธ์ ์ผ๋ก ์๋ฒ ๋ฉ ํ
์ด๋ธ์ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์ดํด๋ณด๊ณ RRef ๋ชฉ๋ก์ ๋ฐํํฉ๋๋ค. |
| 114 | +ํธ๋ ์ด๋๋ RPC๋ฅผ ํตํด ๋งค๊ฐ๋ณ์ ์๋ฒ์์ ์ด ๋ฉ์๋๋ฅผ ํธ์ถํ์ฌ ์ํ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ์์ ํฉ๋๋ค. |
| 115 | +DistributedOptimizer๋ ํญ์ ์ต์ ํํด์ผ ํ๋ ๋งค๊ฐ๋ณ์์ ๋ํ RRef ๋ชฉ๋ก์ ๊ฐ์ ธ์ค๊ธฐ ๋๋ฌธ์ FC ๋ ์ด์ด์ ์ ์ญ ๋งค๊ฐ๋ณ์์ ๋ํด์๋ RRef๋ฅผ ์์ฑํด์ผ ํฉ๋๋ค. |
| 116 | +์ด๊ฒ์ ``model.fc.parameters()``๋ฅผ ํ์ํ๊ณ ๊ฐ ๋งค๊ฐ๋ณ์์ ๋ํ RRef๋ฅผ ์์ฑํ๊ณ |
| 117 | +``remote_parameters()``์์ ๋ฐํ๋ ๋ชฉ๋ก์ ์ถ๊ฐํจ์ผ๋ก์จ ์ํ๋ฉ๋๋ค. |
| 118 | +์ฐธ๊ณ ๋ก ``model.parameters()``๋ ์ฌ์ฉํ ์ ์์ต๋๋ค. ``RemoteModule``์์ ์ง์ํ์ง ์๋ ``model.remote_emb_module.parameters()``๋ฅผ ์ฌ๊ท์ ์ผ๋ก ํธ์ถํ๊ธฐ ๋๋ฌธ์
๋๋ค. |
| 119 | +
|
| 120 | +๋ง์ง๋ง์ผ๋ก ๋ชจ๋ RRef๋ฅผ ์ฌ์ฉํ์ฌ DistributedOptimizer๋ฅผ ๋ง๋ค๊ณ CrossEntropyLoss ํจ์๋ฅผ ์ ์ํฉ๋๋ค. |
136 | 121 |
|
137 | 122 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
|
138 | 123 | :language: py
|
139 | 124 | :start-after: BEGIN setup_trainer
|
140 | 125 | :end-before: END setup_trainer
|
141 | 126 |
|
142 |
| -Now we're ready to introduce the main training loop that is run on each trainer. |
143 |
| -``get_next_batch`` is just a helper function to generate random inputs and |
144 |
| -targets for training. We run the training loop for multiple epochs and for each |
145 |
| -batch: |
| 127 | +์ด์ ๊ฐ ํธ๋ ์ด๋์์ ์คํ๋๋ ๊ธฐ๋ณธ ํ์ต ๋ฃจํ๋ฅผ ์๊ฐํ๊ฒ ์ต๋๋ค. |
| 128 | +``get_next_batch``๋ ํ์ต์ ์ํ ์์์ ์
๋ ฅ๊ณผ ๋์์ ์์ฑํ๋ ๊ฒ์ ๋์์ฃผ๋ ํจ์์ผ ๋ฟ์
๋๋ค. |
| 129 | +์ฌ๋ฌ ์ํญ(epoch)๊ณผ ๊ฐ ๋ฐฐ์น(batch)์ ๋ํด ํ์ต ๋ฃจํ๋ฅผ ์คํํฉ๋๋ค: |
146 | 130 |
|
147 |
| -1) Setup a `Distributed Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__ |
148 |
| - for Distributed Autograd. |
149 |
| -2) Run the forward pass of the model and retrieve its output. |
150 |
| -3) Compute the loss based on our outputs and targets using the loss function. |
151 |
| -4) Use Distributed Autograd to execute a distributed backward pass using the loss. |
152 |
| -5) Finally, run a Distributed Optimizer step to optimize all the parameters. |
| 131 | +1) ๋จผ์ ๋ถ์ฐ Autograd์ ๋ํด |
| 132 | + `๋ถ์ฐ Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__๋ฅผ ์ค์ ํฉ๋๋ค. |
| 133 | +2) ๋ชจ๋ธ์ ์๋ฐฉํฅ ์ ๋ฌ์ ์คํํ๊ณ ํด๋น ์ถ๋ ฅ์ ๊ฒ์(retrieve)ํฉ๋๋ค. |
| 134 | +3) ์์ค ํจ์๋ฅผ ์ฌ์ฉํ์ฌ ์ถ๋ ฅ๊ณผ ๋ชฉํ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์์ค์ ๊ณ์ฐํฉ๋๋ค. |
| 135 | +4) ๋ถ์ฐ Autograd๋ฅผ ์ฌ์ฉํ์ฌ ์์ค์ ์ฌ์ฉํ์ฌ ๋ถ์ฐ ์ญ๋ฐฉํฅ ์ ๋ฌ์ ์คํํฉ๋๋ค. |
| 136 | +5) ๋ง์ง๋ง์ผ๋ก ๋ถ์ฐ ์ตํฐ๋ง์ด์ ๋จ๊ณ๋ฅผ ์คํํ์ฌ ๋ชจ๋ ๋งค๊ฐ๋ณ์๋ฅผ ์ต์ ํํฉ๋๋ค. |
153 | 137 |
|
154 | 138 | .. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
|
155 | 139 | :language: py
|
156 | 140 | :start-after: BEGIN run_trainer
|
157 | 141 | :end-before: END run_trainer
|
158 | 142 | .. code:: python
|
159 | 143 |
|
160 |
| -Source code for the entire example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__. |
| 144 | +์ ์ฒด ์์ ์ ์์ค ์ฝ๋๋ `์ฌ๊ธฐ <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__์์ ์ฐพ์ ์ ์์ต๋๋ค. |
0 commit comments