-
Notifications
You must be signed in to change notification settings - Fork 77
Transforms tutorial #1123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Transforms tutorial #1123
Conversation
examples/decoding/transforms.py
Outdated
| v2.Resize(size=(480, 640)), | ||
| v2.CenterCrop(size=(315, 220)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It usually makes more sense to first crop and then resize, because resize will then work on a smaller surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it does, and curiously, it actually makes decoder transforms faster than the TorchVision version now (at least on my dev machine).
Results with the old way:
0:
decoder transforms: times_med = 1474.17ms +- 79.85
torchvision transform: times_med = 4683.55ms +- 28.71
1:
decoder transforms: times_med = 18486.50ms +- 165.66
torchvision transform: times_med = 16066.02ms +- 164.19
Results with the new way:
0:
decoder transforms: times_med = 1352.46ms +- 34.86
torchvision transform: times_med = 4077.44ms +- 45.63
1:
decoder transforms: times_med = 14771.99ms +- 148.83
torchvision transform: times_med = 16112.88ms +- 62.15
examples/decoding/transforms.py
Outdated
| # particularly when applying transforms that reduce the size of a frame, such | ||
| # as resize and crop. Because the transforms are applied during decoding, the | ||
| # full frame is never returned to the Python layer. As a result, there is | ||
| # significantly less pressure on the Python gargabe collector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's another core reason why that's more memory efficient: the decompressed RGB frame is never materialized in its original resolution.
Without decoder-native transform we have:
YUV compressed frame in original res -> RGB decompressed frame in original res -> RGB decompressed frame in final (smaller) res
WIth the decoder-native transform we have:
YUV compressed frame in original res -> RGB decompressed frame in final (smaller) res
i.e. we can skip the "RGB decompressed frame in original res" materialization, which is the most memory-expensive bit.
The garbage collector being less pressure is a consequence of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not entirely accurate - we definitely never get the "RGB decompressed frame in original res" in the Python layer, but it exists in FFmpeg. This is because we ensure that the FFmpeg filters get applied in the output color space. So without decoder transforms we have (parenthesis to indicate where it happens, TC or TV):
YUV compressed, original res (TC) ->
RGB decompressed , original res (TC) ->
RGB decompressed, smaller res (TV)
With decoder transforms it's:
YUV compressed, original res (TC) ->
RGB decompressed, original res (TC) ->
RGB decompressed, smaller res (TC)
So we really do go through the same steps in decoder transforms. That middle step - getting the RGB image in the original resolution - is because of this line:
| filters_ = "format=rgb24," + filters.str(); |
Eliminating the explicit "format=rgb24" does improve performance a lot, but at the cost of similarity with using TorchVision transforms on full frames.
Since the filtergraph inputs and outputs are known statically, I suspect they're able to optimize things and reuse memory. That is, it's possible for them to allocate exactly the memory they need for each step and reuse it every time. But I don't know that's the case. I'll try to say something about all this.
|
Let's add a grid card item so this tutorial appears on the "Home" tab / torchcodec/docs/source/index.rst Lines 87 to 93 in 6300361
|
examples/decoding/transforms.py
Outdated
| the :class:`~torchcodec.decoders.VideoDecoder` class. This parameter allows us | ||
| to specify a list of :class:`torchcodec.transforms.DecoderTransform` or | ||
| :class:`torchvision.transforms.v2.Transform` objects. These objects serve as | ||
| transform specificiations that the :class:`~torchcodec.decoders.VideoDecoder` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: specifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably start asking Claude to do spell check on the comments. 🤔
examples/decoding/transforms.py
Outdated
| """ | ||
|
|
||
| # %% | ||
| # First, a bit of boilerplate and definitions that we will use later: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding point 1 in the PR description about the demonstration starting a quarter down the page - we have a pattern of having a link to skip past the boiler plate section, that might help this gap feel smaller.
torchcodec/examples/decoding/sampling.py
Lines 17 to 20 in 6300361
| # %% | |
| # First, a bit of boilerplate: we'll download a video from the web, and define a | |
| # plotting utility. You can ignore that part and jump right below to | |
| # :ref:`sampling_tuto_start`. |
| print(f"torchvision transform: {bench(sample_torchvision_transforms, num_threads=1)}") | ||
|
|
||
| # %% | ||
| # In brief, our performance guidance is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth mentioning decoder native transforms in the performance tips docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mollyxu, yes, absolutely. I'd like to do that in a follow-up PR.
First draft of the transform tutorial. Things to consider:
transformsusage inVideoDecoderup higher?DecoderTransformdocstring because I feel that information is critical, and I didn't see a point in trying to rephrase it.Once we align on 4 and 5, we should also update the performance tutorial. I think that should be a separate PR.