Skip to content

Conversation

@scotts
Copy link
Contributor

@scotts scotts commented Dec 10, 2025

This PR changes the transform benchmarks to:

  1. Use the public VideoDecoder API instead of the core API as originally implemented; see Decoder-native transforms benchmark #982. The benchmarks were implemented before the public API existed.
  2. Adds more command-line parameters.
  3. Ensures that the number of FFmpeg threads is both a parameter and its default does the right thing.

The last point is the most important, as the previously reported benchmarks were, unintentionally, using 0 as the number of FFmpeg threads. That meant FFmpeg would decide, which usually uses half of the available cores.

Runtime Performance

I'm going to drop two large batch of results because I think it's useful for posterity. First, I'm using this video:

ffmpeg -y -f lavfi -i "mandelbrot=s=1920x1080" -t 120 -c:v libopenh264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot.mp4
With threads set to 0, where FFmpeg will use half of my system:
$ python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5 --num-threads 0
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 3006.56, mean = 3004.81 +- 48.55, min = 2950.30, max = 3072.42 - in ms
decoder_resize((540, 960))                    med = 2933.02, mean = 2890.73 +- 73.60, min = 2781.78, max = 2953.67 - in ms
torchvision_crop((540, 960))                  med = 2864.97, mean = 2846.18 +- 55.07, min = 2782.38, max = 2902.71 - in ms
decoder_crop((540, 960))                      med = 2817.70, mean = 2789.60 +- 50.54, min = 2725.13, max = 2835.21 - in ms

torchvision_resize((270, 480))                med = 2897.15, mean = 2896.86 +- 25.80, min = 2864.67, max = 2922.16 - in ms
decoder_resize((270, 480))                    med = 2907.26, mean = 2875.35 +- 85.59, min = 2783.23, max = 2951.16 - in ms
torchvision_crop((270, 480))                  med = 2906.37, mean = 2897.53 +- 57.39, min = 2807.47, max = 2966.51 - in ms
decoder_crop((270, 480))                      med = 2773.08, mean = 2787.57 +- 43.39, min = 2755.06, max = 2862.01 - in ms

torchvision_resize((135, 240))                med = 2890.27, mean = 2896.13 +- 29.21, min = 2869.95, max = 2946.07 - in ms
decoder_resize((135, 240))                    med = 2823.41, mean = 2815.61 +- 85.96, min = 2684.00, max = 2918.04 - in ms
torchvision_crop((135, 240))                  med = 2894.76, mean = 2899.43 +- 53.29, min = 2834.66, max = 2954.22 - in ms
decoder_crop((135, 240))                      med = 2802.06, mean = 2801.27 +- 61.00, min = 2712.93, max = 2880.97 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 3753.92, mean = 3787.75 +- 91.46, min = 3685.64, max = 3918.23 - in ms
decoder_resize((540, 960))                    med = 3631.12, mean = 3608.48 +- 66.05, min = 3537.92, max = 3670.47 - in ms
torchvision_crop((540, 960))                  med = 3520.03, mean = 3507.02 +- 30.43, min = 3457.22, max = 3529.65 - in ms
decoder_crop((540, 960))                      med = 3536.93, mean = 3521.72 +- 24.95, min = 3485.61, max = 3541.21 - in ms

torchvision_resize((270, 480))                med = 3698.27, mean = 3656.23 +- 62.46, min = 3586.39, max = 3706.44 - in ms
decoder_resize((270, 480))                    med = 3617.95, mean = 3622.41 +- 151.72, min = 3455.39, max = 3807.51 - in ms
torchvision_crop((270, 480))                  med = 3676.44, mean = 3669.73 +- 40.30, min = 3624.43, max = 3714.13 - in ms
decoder_crop((270, 480))                      med = 3438.98, mean = 3470.90 +- 85.22, min = 3373.86, max = 3592.65 - in ms

torchvision_resize((135, 240))                med = 3683.92, mean = 3666.65 +- 70.56, min = 3592.49, max = 3758.75 - in ms
decoder_resize((135, 240))                    med = 3519.28, mean = 3539.93 +- 37.77, min = 3506.18, max = 3593.31 - in ms
torchvision_crop((135, 240))                  med = 3561.68, mean = 3662.18 +- 193.42, min = 3495.19, max = 3918.03 - in ms
decoder_crop((135, 240))                      med = 3558.00, mean = 3515.12 +- 105.21, min = 3360.67, max = 3619.11 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 6310.45, mean = 6279.03 +- 88.34, min = 6122.36, max = 6335.79 - in ms
decoder_resize((540, 960))                    med = 4645.45, mean = 4658.60 +- 29.03, min = 4633.45, max = 4702.54 - in ms
torchvision_crop((540, 960))                  med = 4938.24, mean = 4931.62 +- 55.29, min = 4846.19, max = 4992.57 - in ms
decoder_crop((540, 960))                      med = 4345.87, mean = 4362.94 +- 62.46, min = 4289.98, max = 4445.79 - in ms

torchvision_resize((270, 480))                med = 5687.46, mean = 5694.70 +- 47.98, min = 5643.08, max = 5764.25 - in ms
decoder_resize((270, 480))                    med = 4448.39, mean = 4476.09 +- 62.34, min = 4411.99, max = 4571.51 - in ms
torchvision_crop((270, 480))                  med = 4934.87, mean = 4960.69 +- 74.59, min = 4908.33, max = 5090.58 - in ms
decoder_crop((270, 480))                      med = 4275.85, mean = 4260.12 +- 48.30, min = 4185.06, max = 4304.83 - in ms

torchvision_resize((135, 240))                med = 5404.12, mean = 5418.91 +- 63.23, min = 5354.71, max = 5520.74 - in ms
decoder_resize((135, 240))                    med = 4383.50, mean = 4390.66 +- 20.54, min = 4376.85, max = 4426.62 - in ms
torchvision_crop((135, 240))                  med = 4862.93, mean = 4866.41 +- 44.02, min = 4809.66, max = 4920.67 - in ms
decoder_crop((135, 240))                      med = 4239.89, mean = 4224.09 +- 41.37, min = 4173.58, max = 4268.28 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 8099.23, mean = 8132.06 +- 107.66, min = 8036.62, max = 8286.54 - in ms
decoder_resize((540, 960))                    med = 5082.38, mean = 5053.43 +- 54.25, min = 4991.27, max = 5110.40 - in ms
torchvision_crop((540, 960))                  med = 5286.84, mean = 5298.40 +- 19.91, min = 5281.03, max = 5326.81 - in ms
decoder_crop((540, 960))                      med = 4561.46, mean = 4567.39 +- 27.28, min = 4541.30, max = 4605.51 - in ms

torchvision_resize((270, 480))                med = 6897.25, mean = 6878.87 +- 65.91, min = 6789.60, max = 6938.21 - in ms
decoder_resize((270, 480))                    med = 4813.42, mean = 4823.67 +- 36.58, min = 4792.88, max = 4882.58 - in ms
torchvision_crop((270, 480))                  med = 5203.32, mean = 5201.83 +- 50.65, min = 5131.86, max = 5254.45 - in ms
decoder_crop((270, 480))                      med = 4576.46, mean = 4557.36 +- 36.09, min = 4499.04, max = 4586.19 - in ms

torchvision_resize((135, 240))                med = 6284.11, mean = 6267.33 +- 71.49, min = 6144.03, max = 6321.89 - in ms
decoder_resize((135, 240))                    med = 4805.02, mean = 4779.94 +- 88.93, min = 4679.31, max = 4873.81 - in ms
torchvision_crop((135, 240))                  med = 5211.46, mean = 5227.46 +- 58.04, min = 5184.95, max = 5327.92 - in ms
decoder_crop((135, 240))                      med = 4523.77, mean = 4523.37 +- 39.47, min = 4467.71, max = 4572.99 - in ms
And then with 1 thread:
$ python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5 --num-threads 1
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 18131.61, mean = 18102.38 +- 149.62, min = 17888.27, max = 18268.98 - in ms
decoder_resize((540, 960))                    med = 18201.50, mean = 18261.95 +- 137.81, min = 18154.36, max = 18485.10 - in ms
torchvision_crop((540, 960))                  med = 17877.96, mean = 17861.91 +- 56.63, min = 17773.04, max = 17911.99 - in ms
decoder_crop((540, 960))                      med = 17771.08, mean = 17758.01 +- 41.33, min = 17685.27, max = 17785.52 - in ms

torchvision_resize((270, 480))                med = 18006.64, mean = 18032.64 +- 84.55, min = 17935.69, max = 18163.09 - in ms
decoder_resize((270, 480))                    med = 18095.68, mean = 18200.20 +- 252.82, min = 17981.77, max = 18563.14 - in ms
torchvision_crop((270, 480))                  med = 18023.64, mean = 18018.42 +- 63.02, min = 17917.22, max = 18088.02 - in ms
decoder_crop((270, 480))                      med = 18044.71, mean = 17989.09 +- 115.62, min = 17801.89, max = 18078.01 - in ms

torchvision_resize((135, 240))                med = 17934.54, mean = 17918.85 +- 46.36, min = 17846.72, max = 17960.52 - in ms
decoder_resize((135, 240))                    med = 17929.54, mean = 18003.14 +- 149.91, min = 17893.80, max = 18255.33 - in ms
torchvision_crop((135, 240))                  med = 17878.26, mean = 17916.97 +- 82.69, min = 17841.21, max = 18015.84 - in ms
decoder_crop((135, 240))                      med = 17987.70, mean = 18215.67 +- 601.03, min = 17706.06, max = 19231.63 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 22877.97, mean = 22989.13 +- 303.27, min = 22754.35, max = 23510.40 - in ms
decoder_resize((540, 960))                    med = 23253.30, mean = 23346.07 +- 348.92, min = 23127.41, max = 23959.95 - in ms
torchvision_crop((540, 960))                  med = 22864.57, mean = 22883.20 +- 141.02, min = 22691.47, max = 23084.99 - in ms
decoder_crop((540, 960))                      med = 22561.47, mean = 22702.42 +- 281.03, min = 22425.31, max = 23024.37 - in ms

torchvision_resize((270, 480))                med = 22713.37, mean = 22755.43 +- 104.58, min = 22665.30, max = 22904.19 - in ms
decoder_resize((270, 480))                    med = 22884.69, mean = 22900.10 +- 104.08, min = 22764.80, max = 23028.52 - in ms
torchvision_crop((270, 480))                  med = 22547.32, mean = 22549.32 +- 52.01, min = 22473.02, max = 22618.18 - in ms
decoder_crop((270, 480))                      med = 22401.76, mean = 22447.36 +- 96.94, min = 22354.52, max = 22592.69 - in ms

torchvision_resize((135, 240))                med = 23047.19, mean = 22925.35 +- 205.85, min = 22693.56, max = 23089.60 - in ms
decoder_resize((135, 240))                    med = 22967.58, mean = 22949.48 +- 110.34, min = 22828.10, max = 23103.66 - in ms
torchvision_crop((135, 240))                  med = 22641.84, mean = 22623.04 +- 32.86, min = 22571.59, max = 22649.87 - in ms
decoder_crop((135, 240))                      med = 22435.13, mean = 22428.62 +- 72.18, min = 22333.83, max = 22527.30 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 28903.91, mean = 28989.61 +- 176.48, min = 28875.88, max = 29293.35 - in ms
decoder_resize((540, 960))                    med = 30762.54, mean = 30778.12 +- 319.32, min = 30477.67, max = 31277.21 - in ms
torchvision_crop((540, 960))                  med = 27651.92, mean = 27626.28 +- 71.92, min = 27523.07, max = 27697.98 - in ms
decoder_crop((540, 960))                      med = 26738.39, mean = 26766.57 +- 183.99, min = 26521.77, max = 27033.37 - in ms

torchvision_resize((270, 480))                med = 28278.45, mean = 28287.42 +- 64.26, min = 28213.16, max = 28361.05 - in ms
decoder_resize((270, 480))                    med = 29079.33, mean = 29108.04 +- 160.01, min = 28950.96, max = 29334.79 - in ms
torchvision_crop((270, 480))                  med = 27563.43, mean = 27612.72 +- 126.03, min = 27500.05, max = 27807.44 - in ms
decoder_crop((270, 480))                      med = 26819.88, mean = 26860.30 +- 175.75, min = 26637.92, max = 27069.36 - in ms

torchvision_resize((135, 240))                med = 28307.79, mean = 28279.58 +- 355.32, min = 27944.01, max = 28835.09 - in ms
decoder_resize((135, 240))                    med = 29654.80, mean = 29560.30 +- 392.00, min = 28906.78, max = 29958.71 - in ms
torchvision_crop((135, 240))                  med = 27383.97, mean = 27440.17 +- 167.43, min = 27270.75, max = 27697.96 - in ms
decoder_crop((135, 240))                      med = 26722.81, mean = 26701.36 +- 89.16, min = 26553.66, max = 26792.45 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 32284.64, mean = 32356.82 +- 303.45, min = 32122.33, max = 32883.67 - in ms
decoder_resize((540, 960))                    med = 36423.08, mean = 36481.91 +- 185.25, min = 36311.75, max = 36796.18 - in ms
torchvision_crop((540, 960))                  med = 29460.67, mean = 29480.52 +- 260.56, min = 29201.84, max = 29870.92 - in ms
decoder_crop((540, 960))                      med = 27939.71, mean = 27958.15 +- 148.42, min = 27826.73, max = 28181.93 - in ms

torchvision_resize((270, 480))                med = 30866.56, mean = 30949.88 +- 237.82, min = 30778.52, max = 31362.19 - in ms
decoder_resize((270, 480))                    med = 33179.70, mean = 33189.49 +- 275.80, min = 32913.54, max = 33616.09 - in ms
torchvision_crop((270, 480))                  med = 29637.93, mean = 29608.11 +- 129.47, min = 29463.67, max = 29765.43 - in ms
decoder_crop((270, 480))                      med = 27892.94, mean = 27887.00 +- 108.59, min = 27752.73, max = 28048.34 - in ms

torchvision_resize((135, 240))                med = 30484.33, mean = 30472.07 +- 107.43, min = 30335.45, max = 30612.89 - in ms
decoder_resize((135, 240))                    med = 31971.49, mean = 31948.56 +- 145.47, min = 31735.10, max = 32103.10 - in ms
torchvision_crop((135, 240))                  med = 29441.21, mean = 29542.92 +- 362.14, min = 29204.78, max = 30123.61 - in ms
decoder_crop((135, 240))                      med = 28146.61, mean = 28157.64 +- 196.44, min = 27905.72, max = 28416.11 - in ms

I'm going to focus on the difference between the largest number of frames sampled with the largest size reduction with 0 and 1 threads:

0 threads:
Sampling 10.0%, 720, of 7200 frames
torchvision_resize((135, 240))                med = 6284.11, mean = 6267.33 +- 71.49, min = 6144.03, max = 6321.89 - in ms
decoder_resize((135, 240))                    med = 4805.02, mean = 4779.94 +- 88.93, min = 4679.31, max = 4873.81 - in ms
torchvision_crop((135, 240))                  med = 5211.46, mean = 5227.46 +- 58.04, min = 5184.95, max = 5327.92 - in ms
decoder_crop((135, 240))                      med = 4523.77, mean = 4523.37 +- 39.47, min = 4467.71, max = 4572.99 - in ms

1 thread:
Sampling 10.0%, 720, of 7200 frames
torchvision_resize((135, 240))                med = 30484.33, mean = 30472.07 +- 107.43, min = 30335.45, max = 30612.89 - in ms
decoder_resize((135, 240))                    med = 31971.49, mean = 31948.56 +- 145.47, min = 31735.10, max = 32103.10 - in ms
torchvision_crop((135, 240))                  med = 29441.21, mean = 29542.92 +- 362.14, min = 29204.78, max = 30123.61 - in ms
decoder_crop((135, 240))                      med = 28146.61, mean = 28157.64 +- 196.44, min = 27905.72, max = 28416.11 - in ms

Importantly, the number of threads changes the relative costs between using decoder transforms versus passing the fully decoded frame to TorchVision. I cannot yet fully explains these results, but possible factors are:

  1. Startup cost of filtergraph. The fully decoded frames are not using filtergraph, but just using swscale directly for the color conversion. Resize transform optimization: use swscale when appropriate #1018 could address this case. I am skeptical that the startup cost of filtergraph is the primary cause, though, as I would think it would be amortized over the number of frames and the large resolution.
  2. It's possible the TorchVision resize implementation is faster than sws_scale() in swscale. We know TorchVision's resize is SIMDized, and libswscale appears to be as well. (See: https://github.com/FFmpeg/FFmpeg/tree/master/libswscale/x86.)
  3. The reason the threads improve performance so much is that it's also parallelizing the resize (and crop). For the TorchVision use-case, the TorchVision transforms are applied serially.
  4. In our filtergraph, we're explicitly prepending format=rgb24 before the transforms, including resize. This ensures that the transforms are applied in the output colorspace, but it also forces an extra call to sws_scale(). Not forcing this buys about 7-10% of the performance back. However, in principle, this should be basically the same as just applying the TorchVision resize after a normal decoding.

Memory Performance

Here's the good news: decoder transforms are radically more memory efficient. I haven't been able to instrument the benchmarks yet to capture this, but I can observe it through top.

If I hack the benchmark to only run torchvision_resize(), then the RSS grows to 4.3 GB in cycles. That is, it gets that high, goes down way less than 1 GB, then grows again. I think we're observing the Python garbage collector kicking in.

If I hack the benchmark to only run decoder_resize(), then the RSS stays at 0.4 MB the entire time.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 10, 2025
@scotts scotts marked this pull request as ready for review December 10, 2025 03:17
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @scotts !

Comment on lines +105 to +110
parser.add_argument(
"--num-threads",
type=int,
default=1,
help="number of threads to use; 0 means FFmpeg decides",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to also call torch.set_num_threads(args.num_threads) when num_threads != 0? In the current conditions of the benchmark, I think torch's resize isn't multi-threaded, so this should have no effect. But there are code paths where it is multithreaded over the batch dimension, depending on the input dtype and the interpolation mode (example: https://github.com/pytorch/pytorch/blame/afb173d9b9440d804b5f77d0c291e53c720d1fcf/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L2024C18-L2024C18).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was as clean as just providing this value to the Torch APIs, I think we should just do it now. But because of the weirdness with 0 (I don't think torch.set_num_threads() has an equivalent for automatic deciding), I think we may want to control it with another flag or do some logic (n_cpus // 2). For those reasons, I'd rather punt on that until we want to get numbers for those scenarios.

@scotts scotts merged commit 0574243 into meta-pytorch:main Dec 10, 2025
67 of 68 checks passed
@scotts scotts deleted the transform_benchmarks branch December 10, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants