Skip to content

Commit f6a8161

Browse files
mollyxuMolly Xu
andauthored
Add performance tips tutorial (#1065)
Co-authored-by: Molly Xu <[email protected]>
1 parent 7581c01 commit f6a8161

File tree

4 files changed

+241
-22
lines changed

4 files changed

+241
-22
lines changed

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ def __call__(self, filename):
8282
"approximate_mode.py",
8383
"sampling.py",
8484
"parallel_decoding.py",
85+
"performance_tips.py",
8586
"custom_frame_mappings.py",
8687
]
8788
else:

docs/source/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,14 @@ Decoding
8484

8585
How to sample regular and random clips from a video
8686

87+
.. grid-item-card:: :octicon:`file-code;1em`
88+
Performance Tips
89+
:img-top: _static/img/card-background.svg
90+
:link: generated_examples/decoding/performance_tips.html
91+
:link-type: url
92+
93+
Tips for optimizing video decoding performance
94+
8795

8896
Encoding
8997
^^^^^^^^

examples/decoding/basic_cuda_example.py

Lines changed: 19 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -18,28 +18,6 @@
1818
running the transform steps. Encoded packets are often much smaller than decoded frames so
1919
CUDA decoding also uses less PCI-e bandwidth.
2020
21-
When to and when not to use CUDA Decoding
22-
-----------------------------------------
23-
24-
CUDA Decoding can offer speed-up over CPU Decoding in a few scenarios:
25-
26-
#. You are decoding a large resolution video
27-
#. You are decoding a large batch of videos that's saturating the CPU
28-
#. You want to do whole-image transforms like scaling or convolutions on the decoded tensors
29-
after decoding
30-
#. Your CPU is saturated and you want to free it up for other work
31-
32-
33-
Here are situations where CUDA Decoding may not make sense:
34-
35-
#. You want bit-exact results compared to CPU Decoding
36-
#. You have small resolution videos and the PCI-e transfer latency is large
37-
#. Your GPU is already busy and CPU is not
38-
39-
It's best to experiment with CUDA Decoding to see if it improves your use-case. With
40-
TorchCodec you can simply pass in a device parameter to the
41-
:class:`~torchcodec.decoders.VideoDecoder` class to use CUDA Decoding.
42-
4321
Installing TorchCodec with CUDA Enabled
4422
---------------------------------------
4523
@@ -113,6 +91,25 @@
11391
print(frame.data.device)
11492

11593

94+
# %%
95+
# Checking for CPU Fallback
96+
# -------------------------------------
97+
#
98+
# In some cases, CUDA decoding may fall back to CPU decoding. This can happen
99+
# when the video codec or format is not supported by the NVDEC hardware decoder, or when NVCUVID wasn't found.
100+
# TorchCodec provides the :class:`~torchcodec.decoders.CpuFallbackStatus` class
101+
# to help you detect when this fallback occurs.
102+
#
103+
# You can access the fallback status via the
104+
# :attr:`~torchcodec.decoders.VideoDecoder.cpu_fallback` attribute:
105+
106+
with set_cuda_backend("beta"):
107+
decoder = VideoDecoder(video_file, device="cuda")
108+
109+
# Check and print the CPU fallback status
110+
print(decoder.cpu_fallback)
111+
112+
116113
# %%
117114
# Visualizing Frames
118115
# -------------------------------------
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Copyright (c) Meta Platforms, Inc. and affiliates.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the BSD-style license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
"""
8+
.. meta::
9+
:description: Learn how to optimize TorchCodec video decoding performance with batch APIs, approximate seeking, multi-threading, and CUDA acceleration.
10+
11+
==============================================
12+
TorchCodec Performance Tips and Best Practices
13+
==============================================
14+
15+
This tutorial consolidates performance optimization techniques for video
16+
decoding with TorchCodec. Learn when and how to apply various strategies
17+
to increase performance.
18+
"""
19+
20+
21+
# %%
22+
# Overview
23+
# --------
24+
#
25+
# When decoding videos with TorchCodec, several techniques can significantly
26+
# improve performance depending on your use case. This guide covers:
27+
#
28+
# 1. **Batch APIs** - Decode multiple frames at once
29+
# 2. **Approximate Mode & Keyframe Mappings** - Trade accuracy for speed
30+
# 3. **Multi-threading** - Parallelize decoding across videos or chunks
31+
# 4. **CUDA Acceleration** - Use GPU decoding for supported formats
32+
#
33+
# We'll explore each technique and when to use it.
34+
35+
# %%
36+
# 1. Use Batch APIs When Possible
37+
# --------------------------------
38+
#
39+
# If you need to decode multiple frames at once, the batch methods are faster than calling single-frame decoding methods multiple times.
40+
# For example, :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` is faster than calling :meth:`~torchcodec.decoders.VideoDecoder.get_frame_at` multiple times.
41+
# TorchCodec's batch APIs reduce overhead and can leverage internal optimizations.
42+
#
43+
# **Key Methods:**
44+
#
45+
# For index-based frame retrieval:
46+
#
47+
# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` for specific indices
48+
# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_in_range` for ranges
49+
#
50+
# For timestamp-based frame retrieval:
51+
#
52+
# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_at` for timestamps
53+
# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_in_range` for time ranges
54+
#
55+
# %%
56+
# **When to use:**
57+
#
58+
# - Decoding multiple frames
59+
60+
# %%
61+
# .. note::
62+
#
63+
# For complete examples with runnable code demonstrating batch decoding,
64+
# iteration, and frame retrieval, see :ref:`sphx_glr_generated_examples_decoding_basic_example.py`
65+
66+
# %%
67+
# 2. Approximate Mode & Keyframe Mappings
68+
# ----------------------------------------
69+
#
70+
# By default, TorchCodec uses ``seek_mode="exact"``, which performs a :term:`scan` when
71+
# you create the decoder to build an accurate internal index of frames. This
72+
# ensures frame-accurate seeking but takes longer for decoder initialization,
73+
# especially on long videos.
74+
75+
# %%
76+
# **Approximate Mode**
77+
# ~~~~~~~~~~~~~~~~~~~~
78+
#
79+
# Setting ``seek_mode="approximate"`` skips the initial :term:`scan` and relies on the
80+
# video file's metadata headers. This dramatically speeds up
81+
# :class:`~torchcodec.decoders.VideoDecoder` creation, particularly for long
82+
# videos, but may result in slightly less accurate seeking in some cases.
83+
#
84+
#
85+
# **Which mode should you use:**
86+
#
87+
# - If you care about exactness of frame seeking, use “exact”.
88+
# - If the video is long and you're only decoding a small amount of frames, approximate mode should be faster.
89+
90+
# %%
91+
# **Custom Frame Mappings**
92+
# ~~~~~~~~~~~~~~~~~~~~~~~~~
93+
#
94+
# For advanced use cases, you can pre-compute a custom mapping between desired
95+
# frame indices and actual keyframe locations. This allows you to speed up :class:`~torchcodec.decoders.VideoDecoder`
96+
# instantiation while maintaining the frame seeking accuracy of ``seek_mode="exact"``
97+
#
98+
# **When to use:**
99+
#
100+
# - Frame accuracy is critical, so you cannot use approximate mode
101+
# - You can preprocess videos once and then decode them many times
102+
#
103+
# **Performance impact:** speeds up decoder instantiation, similarly to ``seek_mode="approximate"``.
104+
105+
# %%
106+
# .. note::
107+
#
108+
# For complete benchmarks showing actual speedup numbers, accuracy comparisons,
109+
# and implementation examples, see :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py`
110+
# and :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py`
111+
112+
# %%
113+
# 3. Multi-threading for Parallel Decoding
114+
# -----------------------------------------
115+
#
116+
# When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process:
117+
#
118+
# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder`
119+
# - **Multiprocessing** - Distributing work across multiple processes
120+
# - **Multithreading** - Using multiple threads within a single process
121+
#
122+
# You can use both multiprocessing and multithreading to decode multiple videos in parallel, or to decode a single long video in parallel by splitting it into chunks.
123+
124+
# %%
125+
# .. note::
126+
#
127+
# For complete examples comparing
128+
# sequential, ffmpeg-based parallelism, multi-process, and multi-threaded approaches, see
129+
# :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py`
130+
131+
# %%
132+
# 4. CUDA Acceleration
133+
# --------------------
134+
#
135+
# TorchCodec supports GPU-accelerated decoding using NVIDIA's hardware decoder
136+
# (NVDEC) on supported hardware. This keeps decoded tensors in GPU memory,
137+
# avoiding expensive CPU-GPU transfers for downstream GPU operations.
138+
#
139+
# %%
140+
# **Recommended: use the Beta Interface!!**
141+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142+
#
143+
# We recommend you use the new "beta" CUDA interface which is significantly faster than the previous one, and supports the same features:
144+
#
145+
# .. code-block:: python
146+
#
147+
# with set_cuda_backend("beta"):
148+
# decoder = VideoDecoder("file.mp4", device="cuda")
149+
#
150+
# %%
151+
# **When to use:**
152+
#
153+
# - Decoding large resolution videos
154+
# - Large batch of videos saturating the CPU
155+
#
156+
# **When NOT to use:**
157+
#
158+
# - You need bit-exact results with CPU decoding
159+
# - Small resolution videos and the PCI-e transfer latency is large
160+
# - GPU is already busy and CPU is idle
161+
#
162+
# **Performance impact:** CUDA decoding can significantly outperform CPU decoding,
163+
# especially for high-resolution videos and when decoding a lot of frames.
164+
# Actual speedup varies by hardware, resolution, and codec.
165+
166+
# %%
167+
# **Checking for CPU Fallback**
168+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
169+
#
170+
# In some cases, CUDA decoding may silently fall back to CPU decoding when the
171+
# video codec or format is not supported by NVDEC. You can detect this using
172+
# the :attr:`~torchcodec.decoders.VideoDecoder.cpu_fallback` attribute:
173+
#
174+
# .. code-block:: python
175+
#
176+
# with set_cuda_backend("beta"):
177+
# decoder = VideoDecoder("file.mp4", device="cuda")
178+
#
179+
# # Print detailed fallback status
180+
# print(decoder.cpu_fallback)
181+
#
182+
# .. note::
183+
#
184+
# The timing of when you can detect CPU fallback differs between backends:
185+
# with the **FFmpeg backend**, you can only check fallback status after decoding at
186+
# least one frame, because FFmpeg determines codec support lazily during decoding;
187+
# with the **BETA backend**, you can check fallback status immediately after
188+
# decoder creation, as the backend checks codec support upfront.
189+
#
190+
# For installation instructions, detailed examples, and visual comparisons
191+
# between CPU and CUDA decoding, see :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py`
192+
193+
# %%
194+
# Conclusion
195+
# ----------
196+
#
197+
# TorchCodec offers multiple performance optimization strategies, each suited to
198+
# different scenarios. Use batch APIs for multi-frame decoding, approximate mode
199+
# for faster initialization, parallel processing for high throughput, and CUDA
200+
# acceleration to offload the CPU.
201+
#
202+
# The best results often come from combining techniques. Profile your specific
203+
# use case and apply optimizations incrementally, using the benchmarks in the
204+
# linked examples as a guide.
205+
#
206+
# For more information, see:
207+
#
208+
# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` - Basic decoding examples
209+
# - :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py` - Approximate mode benchmarks
210+
# - :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py` - Custom frame mappings
211+
# - :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py` - Parallel decoding strategies
212+
# - :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py` - CUDA acceleration guide
213+
# - :class:`torchcodec.decoders.VideoDecoder` - Full API reference

0 commit comments

Comments
 (0)