Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Dec 11, 2025

First draft of the transform tutorial. Things to consider:

  1. The order in which things are presented is, to me, a natural teaching order. But the actual thing we're demonstrating comes a quarter of the way down the page! Does that seem okay when read, or should we try to find a way to pull the transforms usage in VideoDecoder up higher?
  2. I just copied the guarantees that are part of the DecoderTransform docstring because I feel that information is critical, and I didn't see a point in trying to rephrase it.
  3. The second "Note," after those guarantees, has a possibility of being confusing. I think we have to say something on this, but we've made this difficult to talk about because we accept TorchVision transform objects. So in writing, it's hard to distinguish between accepting TorchVision transform objects and applying TorchVision transform objects without getting super wordy. Let me know if you find it potentially confusing.
  4. I am underwhelmed by the lack of demo for memory efficiency, but I don't have a way around it. And I think it needs to be said.
  5. The runtime guidance is subtle. Too subtle?

Once we align on 4 and 5, we should also update the performance tutorial. I think that should be a separate PR.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 11, 2025
@scotts scotts marked this pull request as ready for review December 11, 2025 04:50
Comment on lines 203 to 204
v2.Resize(size=(480, 640)),
v2.CenterCrop(size=(315, 220))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It usually makes more sense to first crop and then resize, because resize will then work on a smaller surface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does, and curiously, it actually makes decoder transforms faster than the TorchVision version now (at least on my dev machine).

Results with the old way:

0:
decoder transforms:    times_med = 1474.17ms +- 79.85
torchvision transform: times_med = 4683.55ms +- 28.71

1:
decoder transforms:    times_med = 18486.50ms +- 165.66
torchvision transform: times_med = 16066.02ms +- 164.19

Results with the new way:

0:
decoder transforms:    times_med = 1352.46ms +- 34.86
torchvision transform: times_med = 4077.44ms +- 45.63

1:
decoder transforms:    times_med = 14771.99ms +- 148.83
torchvision transform: times_med = 16112.88ms +- 62.15

# particularly when applying transforms that reduce the size of a frame, such
# as resize and crop. Because the transforms are applied during decoding, the
# full frame is never returned to the Python layer. As a result, there is
# significantly less pressure on the Python gargabe collector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's another core reason why that's more memory efficient: the decompressed RGB frame is never materialized in its original resolution.

Without decoder-native transform we have:

YUV compressed frame in original res -> RGB decompressed frame in original res -> RGB decompressed frame in final (smaller) res

WIth the decoder-native transform we have:

YUV compressed frame in original res -> RGB decompressed frame in final (smaller) res

i.e. we can skip the "RGB decompressed frame in original res" materialization, which is the most memory-expensive bit.

The garbage collector being less pressure is a consequence of that.

Copy link
Contributor Author

@scotts scotts Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not entirely accurate - we definitely never get the "RGB decompressed frame in original res" in the Python layer, but it exists in FFmpeg. This is because we ensure that the FFmpeg filters get applied in the output color space. So without decoder transforms we have (parenthesis to indicate where it happens, TC or TV):

YUV compressed, original res (TC) -> 
RGB decompressed , original res (TC) -> 
RGB decompressed, smaller res (TV)

With decoder transforms it's:

YUV compressed, original res (TC) -> 
RGB decompressed, original res (TC) -> 
RGB decompressed, smaller res (TC)

So we really do go through the same steps in decoder transforms. That middle step - getting the RGB image in the original resolution - is because of this line:

filters_ = "format=rgb24," + filters.str();

Eliminating the explicit "format=rgb24" does improve performance a lot, but at the cost of similarity with using TorchVision transforms on full frames.

Since the filtergraph inputs and outputs are known statically, I suspect they're able to optimize things and reuse memory. That is, it's possible for them to allocate exactly the memory they need for each step and reuse it every time. But I don't know that's the case. I'll try to say something about all this.

@Dan-Flores
Copy link
Contributor

Let's add a grid card item so this tutorial appears on the "Home" tab / index page:

.. grid-item-card:: :octicon:`file-code;1em`
Performance Tips
:img-top: _static/img/card-background.svg
:link: generated_examples/decoding/performance_tips.html
:link-type: url
Tips for optimizing video decoding performance

the :class:`~torchcodec.decoders.VideoDecoder` class. This parameter allows us
to specify a list of :class:`torchcodec.transforms.DecoderTransform` or
:class:`torchvision.transforms.v2.Transform` objects. These objects serve as
transform specificiations that the :class:`~torchcodec.decoders.VideoDecoder`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specifications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably start asking Claude to do spell check on the comments. 🤔

"""

# %%
# First, a bit of boilerplate and definitions that we will use later:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding point 1 in the PR description about the demonstration starting a quarter down the page - we have a pattern of having a link to skip past the boiler plate section, that might help this gap feel smaller.

# %%
# First, a bit of boilerplate: we'll download a video from the web, and define a
# plotting utility. You can ignore that part and jump right below to
# :ref:`sampling_tuto_start`.

print(f"torchvision transform: {bench(sample_torchvision_transforms, num_threads=1)}")

# %%
# In brief, our performance guidance is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth mentioning decoder native transforms in the performance tips docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mollyxu, yes, absolutely. I'd like to do that in a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants