Update doc

mthrok · web-flow · commit 5036908b7fc4 · 2025-07-26T00:28:29.000-07:00
Differential Revision: D79030332 Pull Request resolved: #854
diff --git a/docs/source/getting_started/parallelism.rst b/docs/source/getting_started/parallelism.rst
@@ -76,6 +76,7 @@ method.
    For loading raw byte strings into array format, SPDL offers efficient
    functions through :py:mod:`spdl.io` module.
 
+.. _pipeline-parallelism-custom-mt:
 
 Multi-threading (custom)
 ------------------------
@@ -100,8 +101,7 @@ instance, or put it in a
 The following example shows how to initialize and store a CUDA stream
 in a thread-local storage.
 
-.. admonition::
-   :class: note
+.. note::
 
    The following code is now available as :py:func:`spdl.io.transfer_tensor`.
 
diff --git a/src/spdl/io/_transfer.py b/src/spdl/io/_transfer.py
@@ -116,13 +116,21 @@ def _get_trancfer_func() -> _DataTransfer:
 
 
 def transfer_tensor(batch: T, /) -> T:
-    """Transfers PyTorch CPU Tensors to CUDA in a background.
+    """Transfers PyTorch CPU Tensors to CUDA in a dedicated stream.
 
     This function wraps calls to :py:meth:`torch.Tensor.pin_memory` and
-    :py:meth:`torch.Tensor.to`, and execute them in a dedicated CUDA stream,
-    so that when called in a background thread, data transfer is carried out
-    in a way it overlaps with the GPU computation happening in the foreground
-    thread (such as training and inference).
+    :py:meth:`torch.Tensor.to`, and execute them in a dedicated CUDA stream.
+
+    When called in a background thread, the data transfer overlaps with
+    the GPU computation happening in the foreground thread (such as training
+    and inference).
+
+    .. seealso::
+
+       :ref:`pipeline-parallelism-custom-mt` - An intended way to use
+       this function in :py:class:`~spdl.pipeline.Pipeline`.
+
+    .. image:: ../../_static/data/parallelism_transfer.png
 
     Concretely, it performs the following operations.