Skip to content

Fix #3186 - create leaf/non-leaf/requires_grad/retain_grad tutorial #3492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
339 changes: 339 additions & 0 deletions beginner_source/understanding_leaf_vs_nonleaf_tutorial.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
"""
Understanding requires_grad, retain_grad, Leaf, and Non-leaf Tensors
====================================================================

**Author:** `Justin Silver <https://github.com/j-silv>`__

This tutorial explains the subtleties of ``requires_grad``,
``retain_grad``, leaf, and non-leaf tensors using a simple example.

Before starting, make sure you understand `tensors and how to manipulate
them <https://docs.pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html>`__.
A basic knowledge of `how autograd
works <https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html>`__
would also be useful.

"""


######################################################################
# Setup
# -----
#
# First, make sure `PyTorch is
# installed <https://pytorch.org/get-started/locally/>`__ and then import
# the necessary libraries.
#

import torch
import torch.nn.functional as F


######################################################################
# Next, we instantiate a simple network to focus on the gradients. This
# will be an affine layer, followed by a ReLU activation, and ending with
# a MSE loss between prediction and label tensors.
#
# .. math::
#
# \mathbf{y}_{\text{pred}} = \text{ReLU}(\mathbf{x} \mathbf{W} + \mathbf{b})
#
# .. math::
#
# L = \text{MSE}(\mathbf{y}_{\text{pred}}, \mathbf{y})
#
# Note that the ``requires_grad=True`` is necessary for the parameters
# (``W`` and ``b``) so that PyTorch tracks operations involving those
# tensors. We’ll discuss more about this in a future
# `section <#requires-grad>`__.
#

# tensor setup
x = torch.ones(1, 3) # input with shape: (1, 3)
W = torch.ones(3, 2, requires_grad=True) # weights with shape: (3, 2)
b = torch.ones(1, 2, requires_grad=True) # bias with shape: (1, 2)
y = torch.ones(1, 2) # output with shape: (1, 2)

# forward pass
z = (x @ W) + b # pre-activation with shape: (1, 2)
y_pred = F.relu(z) # activation with shape: (1, 2)
loss = F.mse_loss(y_pred, y) # scalar loss


######################################################################
# Leaf vs. non-leaf tensors
# -------------------------
#
# After running the forward pass, PyTorch autograd has built up a `dynamic
# computational
# graph <https://docs.pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#computational-graph>`__
# which is shown below. This is a `Directed Acyclic Graph
# (DAG) <https://en.wikipedia.org/wiki/Directed_acyclic_graph>`__ which
# keeps a record of input tensors (leaf nodes), all subsequent operations
# on those tensors, and the intermediate/output tensors (non-leaf nodes).
# The graph is used to compute gradients for each tensor starting from the
# graph roots (outputs) to the leaves (inputs) using the `chain
# rule <https://en.wikipedia.org/wiki/Chain_rule>`__ from calculus:
#
# .. math::
#
# \mathbf{y} = \mathbf{f}_k\bigl(\mathbf{f}_{k-1}(\dots \mathbf{f}_1(\mathbf{x}) \dots)\bigr)
#
# .. math::
#
# \frac{\partial \mathbf{y}}{\partial \mathbf{x}} =
# \frac{\partial \mathbf{f}_k}{\partial \mathbf{f}_{k-1}} \cdot
# \frac{\partial \mathbf{f}_{k-1}}{\partial \mathbf{f}_{k-2}} \cdot
# \cdots \cdot
# \frac{\partial \mathbf{f}_1}{\partial \mathbf{x}}
#
# .. figure:: /_static/img/understanding_leaf_vs_nonleaf/comp-graph-1.png
# :alt: Computational graph after forward pass
#
# Computational graph after forward pass
#
# PyTorch considers a node to be a *leaf* if it is not the result of a
# tensor operation with at least one input having ``requires_grad=True``
# (e.g. ``x``, ``W``, ``b``, and ``y``), and everything else to be
# *non-leaf* (e.g. ``z``, ``y_pred``, and ``loss``). You can verify this
# programmatically by probing the ``is_leaf`` attribute of the tensors:
#

# prints True because new tensors are leafs by convention
print(f"{x.is_leaf=}")

# prints False because tensor is the result of an operation with at
# least one input having requires_grad=True
print(f"{z.is_leaf=}")


######################################################################
# The distinction between leaf and non-leaf determines whether the
# tensor’s gradient will be stored in the ``grad`` property after the
# backward pass, and thus be usable for `gradient
# descent <https://en.wikipedia.org/wiki/Gradient_descent>`__. We’ll cover
# this some more in the `following section <#retain-grad>`__.
#
# Let’s now investigate how PyTorch calculates and stores gradients for
# the tensors in its computational graph.
#


######################################################################
# ``requires_grad``
# -----------------
#
# To build the computational graph which can be used for gradient
# calculation, we need to pass in the ``requires_grad=True`` parameter to
# a tensor constructor. By default, the value is ``False``, and thus
# PyTorch does not track gradients on any created tensors. To verify this,
# try not setting ``requires_grad``, re-run the forward pass, and then run
# backpropagation. You will see:
#
# ::
#
# >>> loss.backward()
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
#
# This error means that autograd can’t backpropagate to any leaf tensors
# because ``loss`` is not tracking gradients. If you need to change the
# property, you can call ``requires_grad_()`` on the tensor (notice the \_
# suffix).
#
# We can sanity check which nodes require gradient calculation, just like
# we did above with the ``is_leaf`` attribute:
#

print(f"{x.requires_grad=}") # prints False because requires_grad=False by default
print(f"{W.requires_grad=}") # prints True because we set requires_grad=True in constructor
print(f"{z.requires_grad=}") # prints True because tensor is a non-leaf node


######################################################################
# It’s useful to remember that a non-leaf tensor has
# ``requires_grad=True`` by definition, since backpropagation would fail
# otherwise. If the tensor is a leaf, then it will only have
# ``requires_grad=True`` if it was specifically set by the user. Another
# way to phrase this is that if at least one of the inputs to a tensor
# requires the gradient, then it will require the gradient as well.
#
# There are two exceptions to this rule:
#
# 1. Any ``nn.Module`` that has ``nn.Parameter`` will have
# ``requires_grad=True`` for its parameters (see
# `here <https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html#creating-models>`__)
# 2. Locally disabling gradient computation with context managers (see
# `here <https://docs.pytorch.org/docs/stable/notes/autograd.html#locally-disabling-gradient-computation>`__)
#
# In summary, ``requires_grad`` tells autograd which tensors need to have
# their gradients calculated for backpropagation to work. This is
# different from which tensors have their ``grad`` field populated, which
# is the topic of the next section.
#


######################################################################
# ``retain_grad``
# ---------------
#
# To actually perform optimization (e.g. SGD, Adam, etc.), we need to run
# the backward pass so that we can extract the gradients.
#

loss.backward()


######################################################################
# Calling ``backward()`` populates the ``grad`` field of all leaf tensors
# which had ``requires_grad=True``. The ``grad`` is the gradient of the
# loss with respect to the tensor we are probing. Before running
# ``backward()``, this attribute is set to ``None``.
#

print(f"{W.grad=}")
print(f"{b.grad=}")


######################################################################
# You might be wondering about the other tensors in our network. Let’s
# check the remaining leaf nodes:
#

# prints all None because requires_grad=False
print(f"{x.grad=}")
print(f"{y.grad=}")


######################################################################
# The gradients for these tensors haven’t been populated because we did
# not explicitly tell PyTorch to calculate their gradient
# (``requires_grad=False``).
#
# Let’s now look at an intermediate non-leaf node:
#

print(f"{z.grad=}")


######################################################################
# PyTorch returns ``None`` for the gradient and also warns us that a
# non-leaf node’s ``grad`` attribute is being accessed. Although autograd
# has to calculate intermediate gradients for backpropagation to work, it
# assumes you don’t need to access the values afterwards. To change this
# behavior, we can use the ``retain_grad()`` function on a tensor. This
# tells the autograd engine to populate that tensor’s ``grad`` after
# calling ``backward()``.
#

# we have to re-run the forward pass
z = (x @ W) + b
y_pred = F.relu(z)
loss = F.mse_loss(y_pred, y)

# tell PyTorch to store the gradients after backward()
z.retain_grad()
y_pred.retain_grad()
loss.retain_grad()

# have to zero out gradients otherwise they would accumulate
W.grad = None
b.grad = None

# backpropagation
loss.backward()

# print gradients for all tensors that have requires_grad=True
print(f"{W.grad=}")
print(f"{b.grad=}")
print(f"{z.grad=}")
print(f"{y_pred.grad=}")
print(f"{loss.grad=}")


######################################################################
# We get the same result for ``W.grad`` as before. Also note that because
# the loss is scalar, the gradient of the loss with respect to itself is
# simply ``1.0``.
#
# If we look at the state of the computational graph now, we see that the
# ``retains_grad`` attribute has changed for the intermediate tensors. By
# convention, this attribute will print ``False`` for any leaf node, even
# if it requires its gradient.
#
# .. figure:: /_static/img/understanding_leaf_vs_nonleaf/comp-graph-2.png
# :alt: Computational graph after backward pass
#
# Computational graph after backward pass
#
# If you call ``retain_grad()`` on a non-leaf node, it results in a no-op.
# If we call ``retain_grad()`` on a node that has ``requires_grad=False``,
# PyTorch actually throws an error, since it can’t store the gradient if
# it is never calculated.
#
# ::
#
# >>> x.retain_grad()
# RuntimeError: can't retain_grad on Tensor that has requires_grad=False
#


######################################################################
# Summary table
# -------------
#
# Using ``retain_grad()`` and ``retains_grad`` only make sense for
# non-leaf nodes, since the ``grad`` attribute will already be populated
# for leaf tensors that have ``requires_grad=True``. By default, these
# non-leaf nodes do not retain (store) their gradient after
# backpropagation. We can change that by rerunning the forward pass,
# telling PyTorch to store the gradients, and then performing
# backpropagation.
#
# The following table can be used as a reference which summarizes the
# above discussions. The following scenarios are the only ones that are
# valid for PyTorch tensors.
#
#
#
# +----------------+------------------------+------------------------+---------------------------------------------------+-------------------------------------+
# | ``is_leaf`` | ``requires_grad`` | ``retains_grad`` | ``require_grad()`` | ``retain_grad()`` |
# +================+========================+========================+===================================================+=====================================+
# | ``True`` | ``False`` | ``False`` | sets ``requires_grad`` to ``True`` or ``False`` | no-op |
# +----------------+------------------------+------------------------+---------------------------------------------------+-------------------------------------+
# | ``True`` | ``True`` | ``False`` | sets ``requires_grad`` to ``True`` or ``False`` | no-op |
# +----------------+------------------------+------------------------+---------------------------------------------------+-------------------------------------+
# | ``False`` | ``True`` | ``False`` | no-op | sets ``retains_grad`` to ``True`` |
# +----------------+------------------------+------------------------+---------------------------------------------------+-------------------------------------+
# | ``False`` | ``True`` | ``True`` | no-op | no-op |
# +----------------+------------------------+------------------------+---------------------------------------------------+-------------------------------------+
#


######################################################################
# Conclusion
# ----------
#
# In this tutorial, we covered when and how PyTorch computes gradients for
# leaf and non-leaf tensors. By using ``retain_grad``, we can access the
# gradients of intermediate tensors within autograd’s computational graph.
#
# If you would like to learn more about how PyTorch’s autograd system
# works, please visit the `references <#references>`__ below. If you have
# any feedback for this tutorial (improvements, typo fixes, etc.) then
# please use the `PyTorch Forums <https://discuss.pytorch.org/>`__ and/or
# the `issue tracker <https://github.com/pytorch/tutorials/issues>`__ to
# reach out.
#


######################################################################
# References
# ----------
#
# - `A Gentle Introduction to
# torch.autograd <https://docs.pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html>`__
# - `Automatic Differentiation with
# torch.autograd <https://docs.pytorch.org/tutorials/beginner/basics/autogradqs_tutorial>`__
# - `Autograd
# mechanics <https://docs.pytorch.org/docs/stable/notes/autograd.html>`__
#
7 changes: 7 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,13 @@ Welcome to PyTorch Tutorials
:link: intermediate/pinmem_nonblock.html
:tags: Getting-Started

.. customcarditem::
:header: Understanding requires_grad, retain_grad, Leaf, and Non-leaf Tensors
:card_description: Learn the subtleties of requires_grad, retain_grad, leaf, and non-leaf tensors
:image: _static/img/thumbnails/cropped/understanding_leaf_vs_nonleaf.png
:link: beginner/understanding_leaf_vs_nonleaf_tutorial.html
:tags: Getting-Started

.. Image/Video

.. customcarditem::
Expand Down