|
| 1 | +# Copyright (c) Meta Platforms, Inc. and affiliates. |
| 2 | +# All rights reserved. |
| 3 | +# |
| 4 | +# This source code is licensed under the BSD-style license found in the |
| 5 | +# LICENSE file in the root directory of this source tree. |
| 6 | + |
| 7 | +""" |
| 8 | +.. meta:: |
| 9 | + :description: Learn how to optimize TorchCodec video decoding performance with batch APIs, approximate seeking, multi-threading, and CUDA acceleration. |
| 10 | +
|
| 11 | +============================================== |
| 12 | +TorchCodec Performance Tips and Best Practices |
| 13 | +============================================== |
| 14 | +
|
| 15 | +This tutorial consolidates performance optimization techniques for video |
| 16 | +decoding with TorchCodec. Learn when and how to apply various strategies |
| 17 | +to increase performance. |
| 18 | +""" |
| 19 | + |
| 20 | + |
| 21 | +# %% |
| 22 | +# Overview |
| 23 | +# -------- |
| 24 | +# |
| 25 | +# When decoding videos with TorchCodec, several techniques can significantly |
| 26 | +# improve performance depending on your use case. This guide covers: |
| 27 | +# |
| 28 | +# 1. **Batch APIs** - Decode multiple frames at once |
| 29 | +# 2. **Approximate Mode & Keyframe Mappings** - Trade accuracy for speed |
| 30 | +# 3. **Multi-threading** - Parallelize decoding across videos or chunks |
| 31 | +# 4. **CUDA Acceleration** - Use GPU decoding for supported formats |
| 32 | +# |
| 33 | +# We'll explore each technique and when to use it. |
| 34 | + |
| 35 | +# %% |
| 36 | +# 1. Use Batch APIs When Possible |
| 37 | +# -------------------------------- |
| 38 | +# |
| 39 | +# If you need to decode multiple frames at once, the batch methods are faster than calling single-frame decoding methods multiple times. |
| 40 | +# For example, :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` is faster than calling :meth:`~torchcodec.decoders.VideoDecoder.get_frame_at` multiple times. |
| 41 | +# TorchCodec's batch APIs reduce overhead and can leverage internal optimizations. |
| 42 | +# |
| 43 | +# **Key Methods:** |
| 44 | +# |
| 45 | +# For index-based frame retrieval: |
| 46 | +# |
| 47 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_at` for specific indices |
| 48 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_in_range` for ranges |
| 49 | +# |
| 50 | +# For timestamp-based frame retrieval: |
| 51 | +# |
| 52 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_at` for timestamps |
| 53 | +# - :meth:`~torchcodec.decoders.VideoDecoder.get_frames_played_in_range` for time ranges |
| 54 | +# |
| 55 | +# %% |
| 56 | +# **When to use:** |
| 57 | +# |
| 58 | +# - Decoding multiple frames |
| 59 | + |
| 60 | +# %% |
| 61 | +# .. note:: |
| 62 | +# |
| 63 | +# For complete examples with runnable code demonstrating batch decoding, |
| 64 | +# iteration, and frame retrieval, see :ref:`sphx_glr_generated_examples_decoding_basic_example.py` |
| 65 | + |
| 66 | +# %% |
| 67 | +# 2. Approximate Mode & Keyframe Mappings |
| 68 | +# ---------------------------------------- |
| 69 | +# |
| 70 | +# By default, TorchCodec uses ``seek_mode="exact"``, which performs a :term:`scan` when |
| 71 | +# you create the decoder to build an accurate internal index of frames. This |
| 72 | +# ensures frame-accurate seeking but takes longer for decoder initialization, |
| 73 | +# especially on long videos. |
| 74 | + |
| 75 | +# %% |
| 76 | +# **Approximate Mode** |
| 77 | +# ~~~~~~~~~~~~~~~~~~~~ |
| 78 | +# |
| 79 | +# Setting ``seek_mode="approximate"`` skips the initial :term:`scan` and relies on the |
| 80 | +# video file's metadata headers. This dramatically speeds up |
| 81 | +# :class:`~torchcodec.decoders.VideoDecoder` creation, particularly for long |
| 82 | +# videos, but may result in slightly less accurate seeking in some cases. |
| 83 | +# |
| 84 | +# |
| 85 | +# **Which mode should you use:** |
| 86 | +# |
| 87 | +# - If you care about exactness of frame seeking, use “exact”. |
| 88 | +# - If the video is long and you're only decoding a small amount of frames, approximate mode should be faster. |
| 89 | + |
| 90 | +# %% |
| 91 | +# **Custom Frame Mappings** |
| 92 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 93 | +# |
| 94 | +# For advanced use cases, you can pre-compute a custom mapping between desired |
| 95 | +# frame indices and actual keyframe locations. This allows you to speed up :class:`~torchcodec.decoders.VideoDecoder` |
| 96 | +# instantiation while maintaining the frame seeking accuracy of ``seek_mode="exact"`` |
| 97 | +# |
| 98 | +# **When to use:** |
| 99 | +# |
| 100 | +# - Frame accuracy is critical, so you cannot use approximate mode |
| 101 | +# - You can preprocess videos once and then decode them many times |
| 102 | +# |
| 103 | +# **Performance impact:** speeds up decoder instantiation, similarly to ``seek_mode="approximate"``. |
| 104 | + |
| 105 | +# %% |
| 106 | +# .. note:: |
| 107 | +# |
| 108 | +# For complete benchmarks showing actual speedup numbers, accuracy comparisons, |
| 109 | +# and implementation examples, see :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py` |
| 110 | +# and :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py` |
| 111 | + |
| 112 | +# %% |
| 113 | +# 3. Multi-threading for Parallel Decoding |
| 114 | +# ----------------------------------------- |
| 115 | +# |
| 116 | +# When decoding multiple videos or decoding a large number of frames from a single video, there are a few parallelization strategies to speed up the decoding process: |
| 117 | +# |
| 118 | +# - **FFmpeg-based parallelism** - Using FFmpeg's internal threading capabilities for intra-frame parallelism, where parallelization happens within individual frames rather than across frames. For that, use the `num_ffmpeg_threads` parameter of the :class:`~torchcodec.decoders.VideoDecoder` |
| 119 | +# - **Multiprocessing** - Distributing work across multiple processes |
| 120 | +# - **Multithreading** - Using multiple threads within a single process |
| 121 | +# |
| 122 | +# You can use both multiprocessing and multithreading to decode multiple videos in parallel, or to decode a single long video in parallel by splitting it into chunks. |
| 123 | + |
| 124 | +# %% |
| 125 | +# .. note:: |
| 126 | +# |
| 127 | +# For complete examples comparing |
| 128 | +# sequential, ffmpeg-based parallelism, multi-process, and multi-threaded approaches, see |
| 129 | +# :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py` |
| 130 | + |
| 131 | +# %% |
| 132 | +# 4. CUDA Acceleration |
| 133 | +# -------------------- |
| 134 | +# |
| 135 | +# TorchCodec supports GPU-accelerated decoding using NVIDIA's hardware decoder |
| 136 | +# (NVDEC) on supported hardware. This keeps decoded tensors in GPU memory, |
| 137 | +# avoiding expensive CPU-GPU transfers for downstream GPU operations. |
| 138 | +# |
| 139 | +# %% |
| 140 | +# **Recommended: use the Beta Interface!!** |
| 141 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 142 | +# |
| 143 | +# We recommend you use the new "beta" CUDA interface which is significantly faster than the previous one, and supports the same features: |
| 144 | +# |
| 145 | +# .. code-block:: python |
| 146 | +# |
| 147 | +# with set_cuda_backend("beta"): |
| 148 | +# decoder = VideoDecoder("file.mp4", device="cuda") |
| 149 | +# |
| 150 | +# %% |
| 151 | +# **When to use:** |
| 152 | +# |
| 153 | +# - Decoding large resolution videos |
| 154 | +# - Large batch of videos saturating the CPU |
| 155 | +# |
| 156 | +# **When NOT to use:** |
| 157 | +# |
| 158 | +# - You need bit-exact results with CPU decoding |
| 159 | +# - Small resolution videos and the PCI-e transfer latency is large |
| 160 | +# - GPU is already busy and CPU is idle |
| 161 | +# |
| 162 | +# **Performance impact:** CUDA decoding can significantly outperform CPU decoding, |
| 163 | +# especially for high-resolution videos and when decoding a lot of frames. |
| 164 | +# Actual speedup varies by hardware, resolution, and codec. |
| 165 | + |
| 166 | +# %% |
| 167 | +# **Checking for CPU Fallback** |
| 168 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 169 | +# |
| 170 | +# In some cases, CUDA decoding may silently fall back to CPU decoding when the |
| 171 | +# video codec or format is not supported by NVDEC. You can detect this using |
| 172 | +# the :attr:`~torchcodec.decoders.VideoDecoder.cpu_fallback` attribute: |
| 173 | +# |
| 174 | +# .. code-block:: python |
| 175 | +# |
| 176 | +# with set_cuda_backend("beta"): |
| 177 | +# decoder = VideoDecoder("file.mp4", device="cuda") |
| 178 | +# |
| 179 | +# # Print detailed fallback status |
| 180 | +# print(decoder.cpu_fallback) |
| 181 | +# |
| 182 | +# .. note:: |
| 183 | +# |
| 184 | +# The timing of when you can detect CPU fallback differs between backends: |
| 185 | +# with the **FFmpeg backend**, you can only check fallback status after decoding at |
| 186 | +# least one frame, because FFmpeg determines codec support lazily during decoding; |
| 187 | +# with the **BETA backend**, you can check fallback status immediately after |
| 188 | +# decoder creation, as the backend checks codec support upfront. |
| 189 | +# |
| 190 | +# For installation instructions, detailed examples, and visual comparisons |
| 191 | +# between CPU and CUDA decoding, see :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py` |
| 192 | + |
| 193 | +# %% |
| 194 | +# Conclusion |
| 195 | +# ---------- |
| 196 | +# |
| 197 | +# TorchCodec offers multiple performance optimization strategies, each suited to |
| 198 | +# different scenarios. Use batch APIs for multi-frame decoding, approximate mode |
| 199 | +# for faster initialization, parallel processing for high throughput, and CUDA |
| 200 | +# acceleration to offload the CPU. |
| 201 | +# |
| 202 | +# The best results often come from combining techniques. Profile your specific |
| 203 | +# use case and apply optimizations incrementally, using the benchmarks in the |
| 204 | +# linked examples as a guide. |
| 205 | +# |
| 206 | +# For more information, see: |
| 207 | +# |
| 208 | +# - :ref:`sphx_glr_generated_examples_decoding_basic_example.py` - Basic decoding examples |
| 209 | +# - :ref:`sphx_glr_generated_examples_decoding_approximate_mode.py` - Approximate mode benchmarks |
| 210 | +# - :ref:`sphx_glr_generated_examples_decoding_custom_frame_mappings.py` - Custom frame mappings |
| 211 | +# - :ref:`sphx_glr_generated_examples_decoding_parallel_decoding.py` - Parallel decoding strategies |
| 212 | +# - :ref:`sphx_glr_generated_examples_decoding_basic_cuda_example.py` - CUDA acceleration guide |
| 213 | +# - :class:`torchcodec.decoders.VideoDecoder` - Full API reference |
0 commit comments