77different learning rate or changing a network layer size can
88dramatically impact model performance.
99
10- Fortunately, there are tools that help with finding the best combination
11- of parameters. `Ray Tune <https://docs.ray.io/en/latest/tune.html>`__ is
12- an industry standard tool for distributed hyperparameter tuning. Ray
13- Tune includes the latest hyperparameter search algorithms, integrates
14- with various analysis libraries, and natively supports distributed
15- training through `Ray’s distributed machine learning
16- engine <https://ray.io/>`__.
10+ This page shows how to integrate `Ray
11+ Tune <https://docs.ray.io/en/latest/tune.html>`__ into your PyTorch
12+ training workflow for distributed hyperparameter tuning. It extends the
13+ PyTorch tutorial for training a CIFAR10 image classifier in the `CIFAR10
14+ tutorial (PyTorch
15+ documentation) <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
1716
18- In this tutorial, we will show you how to integrate Ray Tune into your
19- PyTorch training workflow. We will extend `this tutorial from the
20- PyTorch
21- documentation <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__
22- for training a CIFAR10 image classifier.
17+ Only minor modifications are needed. Specifically, this example wraps
18+ data loading and training in functions, makes some network parameters
19+ configurable, adds optional checkpointing, and defines the search space
20+ for model tuning.
2321
24- We only need to make minor modifications :
22+ To run this tutorial, install the following prerequisites :
2523
26- 1. wrap data loading and training in functions,
27- 2. make some network parameters configurable,
28- 3. add checkpointing (optional),
29- 4. define the search space for the model tuning
24+ - ``ray[tune]`` – Distributed hyperparameter tuning library
25+ - ``torchvision`` – Data transforms for computer vision datasets
3026
31- To run this tutorial, please make sure the following packages are
32- installed:
33-
34- - ``ray[tune]``: Distributed hyperparameter tuning library
35- - ``torchvision``: For the data transformers
36-
37- Setup / Imports
38- ---------------
27+ Setup and imports
28+ -----------------
3929
4030Let’s start with the imports:
4131
@@ -86,8 +76,8 @@ def load_data(data_dir="./data"):
8676# Configurable neural network
8777# ---------------------------
8878#
89- # We can only tune parameters that are configurable. In this example, we
90- # specify the layer sizes of the fully connected layers:
79+ # In this example, we specify the layer sizes of the fully connected
80+ # layers.
9181
9282class Net (nn .Module ):
9383 def __init__ (self , l1 = 120 , l2 = 84 ):
@@ -109,24 +99,23 @@ def forward(self, x):
10999 return x
110100
111101######################################################################
112- # The train function
113- # ------------------
102+ # Train function
103+ # --------------
114104#
115105# Now it gets interesting, because we introduce some changes to the
116- # example ` from the PyTorch
117- # documentation <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
106+ # example from the `CIFAR10 tutorial ( PyTorch
107+ # documentation) <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
118108#
119109# We wrap the training script in a function
120110# ``train_cifar(config, data_dir=None)``. The ``config`` parameter
121111# receives the hyperparameters we want to train with. The ``data_dir``
122112# specifies the directory where we load and store the data, allowing
123113# multiple runs to share the same data source. This is especially useful
124- # in cluster environments where you can mount a shared storage (e.g. NFS)
125- # to this directory, preventing the data from being downloaded to each
126- # node separately. We also load the model and optimizer state at the start
127- # of the run if a checkpoint is provided. Further down in this tutorial,
128- # you will find information on how to save the checkpoint and what it is
129- # used for.
114+ # in cluster environments where you can mount shared storage (for example
115+ # NFS), preventing the data from being downloaded to each node separately.
116+ # We also load the model and optimizer state at the start of the run if a
117+ # checkpoint is provided. Further down in this tutorial, you will find
118+ # information on how to save the checkpoint and what it is used for.
130119#
131120# .. code-block:: python
132121#
@@ -158,9 +147,9 @@ def forward(self, x):
158147# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159148#
160149# Image classification benefits largely from GPUs. Luckily, we can
161- # continue to use PyTorch’s abstractions in Ray Tune. Thus, we can wrap
162- # our model in ``nn.DataParallel`` to support data parallel training on
163- # multiple GPUs:
150+ # continue to use PyTorch’s tools in Ray Tune. Thus, we can wrap our model
151+ # in ``nn.DataParallel`` to support data parallel training on multiple
152+ # GPUs:
164153#
165154# .. code-block:: python
166155#
@@ -185,7 +174,7 @@ def forward(self, x):
185174# GPUs. Notably, Ray also supports `fractional
186175# GPUs <https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#fractional-accelerators>`__
187176# so we can share GPUs among trials, as long as the model still fits on
188- # the GPU memory. We’ll come back to that later.
177+ # the GPU memory. We will return to that later.
189178#
190179# Communicating with Ray Tune
191180# ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -225,15 +214,11 @@ def forward(self, x):
225214# enabling us to pause and resume training.
226215#
227216# To summarize, integrating Ray Tune into your PyTorch training requires
228- # just a few key additions:
229- #
230- # - ``tune.report()`` to report metrics (and optionally checkpoints) to
231- # Ray Tune
232- # - ``tune.get_checkpoint()`` to load a model from a checkpoint
233- # - ``Checkpoint.from_directory()`` to create a checkpoint object from
234- # saved state
235- #
236- # The rest of your training code remains standard PyTorch!
217+ # just a few key additions: use ``tune.report()`` to report metrics (and
218+ # optionally checkpoints) to Ray Tune, ``tune.get_checkpoint()`` to load a
219+ # model from a checkpoint, and ``Checkpoint.from_directory()`` to create a
220+ # checkpoint object from saved state. The rest of your training code
221+ # remains standard PyTorch!
237222#
238223# Full training function
239224# ~~~~~~~~~~~~~~~~~~~~~~
@@ -351,7 +336,7 @@ def train_cifar(config, data_dir=None):
351336# -----------------
352337#
353338# Commonly the performance of a machine learning model is tested on a
354- # hold -out test set with data that has not been used for training the
339+ # held -out test set with data that has not been used for training the
355340# model. We also wrap this in a function:
356341
357342def test_accuracy (net , device = "cpu" ):
@@ -375,11 +360,11 @@ def test_accuracy(net, device="cpu"):
375360 return correct / total
376361
377362######################################################################
378- # The function also expects a ``device`` parameter, so we can do the test
363+ # The function also expects a ``device`` parameter so we can do the test
379364# set validation on a GPU.
380365#
381- # Configuring the search space
382- # ----------------------------
366+ # Search space configuration
367+ # --------------------------
383368#
384369# Lastly, we need to define Ray Tune’s search space. Here is an example:
385370#
@@ -394,10 +379,9 @@ def test_accuracy(net, device="cpu"):
394379#
395380# The ``tune.choice()`` accepts a list of values that are uniformly
396381# sampled from. In this example, the ``l1`` and ``l2`` parameters should
397- # be powers of 2 between 4 and 256, so either 4, 8, 16, 32, 64, 128, or
398- # 256. The ``lr`` (learning rate) should be uniformly sampled between
399- # 0.0001 and 0.1. Lastly, the batch size is a choice between 2, 4, 8, and
400- # 16.
382+ # be powers of 2 between 1 and 256: 1, 2, 4, 8, 16, 32, 64, 128, or 256.
383+ # The ``lr`` (learning rate) should be uniformly sampled between 0.0001
384+ # and 0.1. Lastly, the batch size is a choice between 2, 4, 8, and 16.
401385#
402386# For each trial, Ray Tune samples a combination of parameters from these
403387# search spaces according to the search space configuration and search
@@ -439,13 +423,13 @@ def test_accuracy(net, device="cpu"):
439423# )
440424# results = tuner.fit()
441425#
442- # You can specify the number of CPUs, which are then available e.g. to
426+ # Specify the number of CPUs, which are then available, for example to
443427# increase the ``num_workers`` of the PyTorch ``DataLoader`` instances.
444428# The selected number of GPUs are made visible to PyTorch in each trial.
445- # Trials do not have access to GPUs that haven’t been requested, so you
429+ # Trials do not have access to GPUs that have not been requested, so you
446430# don’t need to worry about resource contention.
447431#
448- # You can also specify fractional GPUs (e.g. , ``gpus_per_trial=0.5``),
432+ # You can specify fractional GPUs (for example , ``gpus_per_trial=0.5``),
449433# which allows trials to share a GPU. Just ensure that the models fit
450434# within the GPU memory.
451435#
@@ -519,7 +503,7 @@ def main(num_trials=10, max_num_epochs=10, gpus_per_trial=2):
519503 main (num_trials = 1 , max_num_epochs = 1 , gpus_per_trial = 0 )
520504
521505######################################################################
522- # If you run the code, an example output could look like this:
506+ # Your output will look something like this:
523507#
524508# .. code-block:: bash
525509#
@@ -548,4 +532,4 @@ def main(num_trials=10, max_num_epochs=10, gpus_per_trial=2):
548532# performing trial achieved a validation accuracy of approximately 47%,
549533# which could be confirmed on the test set.
550534#
551- # So that’s it! You can now tune the parameters of your PyTorch models.
535+ # You can now tune the parameters of your PyTorch models.
0 commit comments