Skip to content

Commit 1f11769

Browse files
author
Ricardo Decal
committed
polish
1 parent 64bc12e commit 1f11769

File tree

1 file changed

+47
-63
lines changed

1 file changed

+47
-63
lines changed

beginner_source/hyperparameter_tuning_tutorial.py

Lines changed: 47 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -7,35 +7,25 @@
77
different learning rate or changing a network layer size can
88
dramatically impact model performance.
99
10-
Fortunately, there are tools that help with finding the best combination
11-
of parameters. `Ray Tune <https://docs.ray.io/en/latest/tune.html>`__ is
12-
an industry standard tool for distributed hyperparameter tuning. Ray
13-
Tune includes the latest hyperparameter search algorithms, integrates
14-
with various analysis libraries, and natively supports distributed
15-
training through `Ray’s distributed machine learning
16-
engine <https://ray.io/>`__.
10+
This page shows how to integrate `Ray
11+
Tune <https://docs.ray.io/en/latest/tune.html>`__ into your PyTorch
12+
training workflow for distributed hyperparameter tuning. It extends the
13+
PyTorch tutorial for training a CIFAR10 image classifier in the `CIFAR10
14+
tutorial (PyTorch
15+
documentation) <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
1716
18-
In this tutorial, we will show you how to integrate Ray Tune into your
19-
PyTorch training workflow. We will extend `this tutorial from the
20-
PyTorch
21-
documentation <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__
22-
for training a CIFAR10 image classifier.
17+
Only minor modifications are needed. Specifically, this example wraps
18+
data loading and training in functions, makes some network parameters
19+
configurable, adds optional checkpointing, and defines the search space
20+
for model tuning.
2321
24-
We only need to make minor modifications:
22+
To run this tutorial, install the following prerequisites:
2523
26-
1. wrap data loading and training in functions,
27-
2. make some network parameters configurable,
28-
3. add checkpointing (optional),
29-
4. define the search space for the model tuning
24+
- ``ray[tune]`` – Distributed hyperparameter tuning library
25+
- ``torchvision`` – Data transforms for computer vision datasets
3026
31-
To run this tutorial, please make sure the following packages are
32-
installed:
33-
34-
- ``ray[tune]``: Distributed hyperparameter tuning library
35-
- ``torchvision``: For the data transformers
36-
37-
Setup / Imports
38-
---------------
27+
Setup and imports
28+
-----------------
3929
4030
Let’s start with the imports:
4131
@@ -86,8 +76,8 @@ def load_data(data_dir="./data"):
8676
# Configurable neural network
8777
# ---------------------------
8878
#
89-
# We can only tune parameters that are configurable. In this example, we
90-
# specify the layer sizes of the fully connected layers:
79+
# In this example, we specify the layer sizes of the fully connected
80+
# layers.
9181

9282
class Net(nn.Module):
9383
def __init__(self, l1=120, l2=84):
@@ -109,24 +99,23 @@ def forward(self, x):
10999
return x
110100

111101
######################################################################
112-
# The train function
113-
# ------------------
102+
# Train function
103+
# --------------
114104
#
115105
# Now it gets interesting, because we introduce some changes to the
116-
# example `from the PyTorch
117-
# documentation <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
106+
# example from the `CIFAR10 tutorial (PyTorch
107+
# documentation) <https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html>`__.
118108
#
119109
# We wrap the training script in a function
120110
# ``train_cifar(config, data_dir=None)``. The ``config`` parameter
121111
# receives the hyperparameters we want to train with. The ``data_dir``
122112
# specifies the directory where we load and store the data, allowing
123113
# multiple runs to share the same data source. This is especially useful
124-
# in cluster environments where you can mount a shared storage (e.g. NFS)
125-
# to this directory, preventing the data from being downloaded to each
126-
# node separately. We also load the model and optimizer state at the start
127-
# of the run if a checkpoint is provided. Further down in this tutorial,
128-
# you will find information on how to save the checkpoint and what it is
129-
# used for.
114+
# in cluster environments where you can mount shared storage (for example
115+
# NFS), preventing the data from being downloaded to each node separately.
116+
# We also load the model and optimizer state at the start of the run if a
117+
# checkpoint is provided. Further down in this tutorial, you will find
118+
# information on how to save the checkpoint and what it is used for.
130119
#
131120
# .. code-block:: python
132121
#
@@ -158,9 +147,9 @@ def forward(self, x):
158147
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159148
#
160149
# Image classification benefits largely from GPUs. Luckily, we can
161-
# continue to use PyTorch’s abstractions in Ray Tune. Thus, we can wrap
162-
# our model in ``nn.DataParallel`` to support data parallel training on
163-
# multiple GPUs:
150+
# continue to use PyTorch’s tools in Ray Tune. Thus, we can wrap our model
151+
# in ``nn.DataParallel`` to support data parallel training on multiple
152+
# GPUs:
164153
#
165154
# .. code-block:: python
166155
#
@@ -185,7 +174,7 @@ def forward(self, x):
185174
# GPUs. Notably, Ray also supports `fractional
186175
# GPUs <https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#fractional-accelerators>`__
187176
# so we can share GPUs among trials, as long as the model still fits on
188-
# the GPU memory. We’ll come back to that later.
177+
# the GPU memory. We will return to that later.
189178
#
190179
# Communicating with Ray Tune
191180
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -225,15 +214,11 @@ def forward(self, x):
225214
# enabling us to pause and resume training.
226215
#
227216
# To summarize, integrating Ray Tune into your PyTorch training requires
228-
# just a few key additions:
229-
#
230-
# - ``tune.report()`` to report metrics (and optionally checkpoints) to
231-
# Ray Tune
232-
# - ``tune.get_checkpoint()`` to load a model from a checkpoint
233-
# - ``Checkpoint.from_directory()`` to create a checkpoint object from
234-
# saved state
235-
#
236-
# The rest of your training code remains standard PyTorch!
217+
# just a few key additions: use ``tune.report()`` to report metrics (and
218+
# optionally checkpoints) to Ray Tune, ``tune.get_checkpoint()`` to load a
219+
# model from a checkpoint, and ``Checkpoint.from_directory()`` to create a
220+
# checkpoint object from saved state. The rest of your training code
221+
# remains standard PyTorch!
237222
#
238223
# Full training function
239224
# ~~~~~~~~~~~~~~~~~~~~~~
@@ -351,7 +336,7 @@ def train_cifar(config, data_dir=None):
351336
# -----------------
352337
#
353338
# Commonly the performance of a machine learning model is tested on a
354-
# hold-out test set with data that has not been used for training the
339+
# held-out test set with data that has not been used for training the
355340
# model. We also wrap this in a function:
356341

357342
def test_accuracy(net, device="cpu"):
@@ -375,11 +360,11 @@ def test_accuracy(net, device="cpu"):
375360
return correct / total
376361

377362
######################################################################
378-
# The function also expects a ``device`` parameter, so we can do the test
363+
# The function also expects a ``device`` parameter so we can do the test
379364
# set validation on a GPU.
380365
#
381-
# Configuring the search space
382-
# ----------------------------
366+
# Search space configuration
367+
# --------------------------
383368
#
384369
# Lastly, we need to define Ray Tune’s search space. Here is an example:
385370
#
@@ -394,10 +379,9 @@ def test_accuracy(net, device="cpu"):
394379
#
395380
# The ``tune.choice()`` accepts a list of values that are uniformly
396381
# sampled from. In this example, the ``l1`` and ``l2`` parameters should
397-
# be powers of 2 between 4 and 256, so either 4, 8, 16, 32, 64, 128, or
398-
# 256. The ``lr`` (learning rate) should be uniformly sampled between
399-
# 0.0001 and 0.1. Lastly, the batch size is a choice between 2, 4, 8, and
400-
# 16.
382+
# be powers of 2 between 1 and 256: 1, 2, 4, 8, 16, 32, 64, 128, or 256.
383+
# The ``lr`` (learning rate) should be uniformly sampled between 0.0001
384+
# and 0.1. Lastly, the batch size is a choice between 2, 4, 8, and 16.
401385
#
402386
# For each trial, Ray Tune samples a combination of parameters from these
403387
# search spaces according to the search space configuration and search
@@ -439,13 +423,13 @@ def test_accuracy(net, device="cpu"):
439423
# )
440424
# results = tuner.fit()
441425
#
442-
# You can specify the number of CPUs, which are then available e.g. to
426+
# Specify the number of CPUs, which are then available, for example to
443427
# increase the ``num_workers`` of the PyTorch ``DataLoader`` instances.
444428
# The selected number of GPUs are made visible to PyTorch in each trial.
445-
# Trials do not have access to GPUs that haven’t been requested, so you
429+
# Trials do not have access to GPUs that have not been requested, so you
446430
# don’t need to worry about resource contention.
447431
#
448-
# You can also specify fractional GPUs (e.g., ``gpus_per_trial=0.5``),
432+
# You can specify fractional GPUs (for example, ``gpus_per_trial=0.5``),
449433
# which allows trials to share a GPU. Just ensure that the models fit
450434
# within the GPU memory.
451435
#
@@ -519,7 +503,7 @@ def main(num_trials=10, max_num_epochs=10, gpus_per_trial=2):
519503
main(num_trials=1, max_num_epochs=1, gpus_per_trial=0)
520504

521505
######################################################################
522-
# If you run the code, an example output could look like this:
506+
# Your output will look something like this:
523507
#
524508
# .. code-block:: bash
525509
#
@@ -548,4 +532,4 @@ def main(num_trials=10, max_num_epochs=10, gpus_per_trial=2):
548532
# performing trial achieved a validation accuracy of approximately 47%,
549533
# which could be confirmed on the test set.
550534
#
551-
# So that’s it! You can now tune the parameters of your PyTorch models.
535+
# You can now tune the parameters of your PyTorch models.

0 commit comments

Comments
 (0)