Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the transform benchmarks to:
VideoDecoderAPI instead of the core API as originally implemented; see Decoder-native transforms benchmark #982. The benchmarks were implemented before the public API existed.The last point is the most important, as the previously reported benchmarks were, unintentionally, using 0 as the number of FFmpeg threads. That meant FFmpeg would decide, which usually uses half of the available cores.
Runtime Performance
I'm going to drop two large batch of results because I think it's useful for posterity. First, I'm using this video:
With threads set to 0, where FFmpeg will use half of my system:
And then with 1 thread:
I'm going to focus on the difference between the largest number of frames sampled with the largest size reduction with 0 and 1 threads:
Importantly, the number of threads changes the relative costs between using decoder transforms versus passing the fully decoded frame to TorchVision. I cannot yet fully explains these results, but possible factors are:
sws_scale()in swscale. We know TorchVision's resize is SIMDized, and libswscale appears to be as well. (See: https://github.com/FFmpeg/FFmpeg/tree/master/libswscale/x86.)format=rgb24before the transforms, including resize. This ensures that the transforms are applied in the output colorspace, but it also forces an extra call tosws_scale(). Not forcing this buys about 7-10% of the performance back. However, in principle, this should be basically the same as just applying the TorchVision resize after a normal decoding.Memory Performance
Here's the good news: decoder transforms are radically more memory efficient. I haven't been able to instrument the benchmarks yet to capture this, but I can observe it through top.
If I hack the benchmark to only run
torchvision_resize(), then the RSS grows to 4.3 GB in cycles. That is, it gets that high, goes down way less than 1 GB, then grows again. I think we're observing the Python garbage collector kicking in.If I hack the benchmark to only run
decoder_resize(), then the RSS stays at 0.4 MB the entire time.