Fix transform benchmarks #1118

scotts · 2025-12-10T02:54:47Z

This PR changes the transform benchmarks to:

Use the public VideoDecoder API instead of the core API as originally implemented; see Decoder-native transforms benchmark #982. The benchmarks were implemented before the public API existed.
Adds more command-line parameters.
Ensures that the number of FFmpeg threads is both a parameter and its default does the right thing.

The last point is the most important, as the previously reported benchmarks were, unintentionally, using 0 as the number of FFmpeg threads. That meant FFmpeg would decide, which usually uses half of the available cores.

Runtime Performance

I'm going to drop two large batch of results because I think it's useful for posterity. First, I'm using this video:

ffmpeg -y -f lavfi -i "mandelbrot=s=1920x1080" -t 120 -c:v libopenh264 -r 60 -g 600 -pix_fmt yuv420p mandelbrot.mp4

With threads set to 0, where FFmpeg will use half of my system:

$ python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5 --num-threads 0
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 3006.56, mean = 3004.81 +- 48.55, min = 2950.30, max = 3072.42 - in ms
decoder_resize((540, 960))                    med = 2933.02, mean = 2890.73 +- 73.60, min = 2781.78, max = 2953.67 - in ms
torchvision_crop((540, 960))                  med = 2864.97, mean = 2846.18 +- 55.07, min = 2782.38, max = 2902.71 - in ms
decoder_crop((540, 960))                      med = 2817.70, mean = 2789.60 +- 50.54, min = 2725.13, max = 2835.21 - in ms

torchvision_resize((270, 480))                med = 2897.15, mean = 2896.86 +- 25.80, min = 2864.67, max = 2922.16 - in ms
decoder_resize((270, 480))                    med = 2907.26, mean = 2875.35 +- 85.59, min = 2783.23, max = 2951.16 - in ms
torchvision_crop((270, 480))                  med = 2906.37, mean = 2897.53 +- 57.39, min = 2807.47, max = 2966.51 - in ms
decoder_crop((270, 480))                      med = 2773.08, mean = 2787.57 +- 43.39, min = 2755.06, max = 2862.01 - in ms

torchvision_resize((135, 240))                med = 2890.27, mean = 2896.13 +- 29.21, min = 2869.95, max = 2946.07 - in ms
decoder_resize((135, 240))                    med = 2823.41, mean = 2815.61 +- 85.96, min = 2684.00, max = 2918.04 - in ms
torchvision_crop((135, 240))                  med = 2894.76, mean = 2899.43 +- 53.29, min = 2834.66, max = 2954.22 - in ms
decoder_crop((135, 240))                      med = 2802.06, mean = 2801.27 +- 61.00, min = 2712.93, max = 2880.97 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 3753.92, mean = 3787.75 +- 91.46, min = 3685.64, max = 3918.23 - in ms
decoder_resize((540, 960))                    med = 3631.12, mean = 3608.48 +- 66.05, min = 3537.92, max = 3670.47 - in ms
torchvision_crop((540, 960))                  med = 3520.03, mean = 3507.02 +- 30.43, min = 3457.22, max = 3529.65 - in ms
decoder_crop((540, 960))                      med = 3536.93, mean = 3521.72 +- 24.95, min = 3485.61, max = 3541.21 - in ms

torchvision_resize((270, 480))                med = 3698.27, mean = 3656.23 +- 62.46, min = 3586.39, max = 3706.44 - in ms
decoder_resize((270, 480))                    med = 3617.95, mean = 3622.41 +- 151.72, min = 3455.39, max = 3807.51 - in ms
torchvision_crop((270, 480))                  med = 3676.44, mean = 3669.73 +- 40.30, min = 3624.43, max = 3714.13 - in ms
decoder_crop((270, 480))                      med = 3438.98, mean = 3470.90 +- 85.22, min = 3373.86, max = 3592.65 - in ms

torchvision_resize((135, 240))                med = 3683.92, mean = 3666.65 +- 70.56, min = 3592.49, max = 3758.75 - in ms
decoder_resize((135, 240))                    med = 3519.28, mean = 3539.93 +- 37.77, min = 3506.18, max = 3593.31 - in ms
torchvision_crop((135, 240))                  med = 3561.68, mean = 3662.18 +- 193.42, min = 3495.19, max = 3918.03 - in ms
decoder_crop((135, 240))                      med = 3558.00, mean = 3515.12 +- 105.21, min = 3360.67, max = 3619.11 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 6310.45, mean = 6279.03 +- 88.34, min = 6122.36, max = 6335.79 - in ms
decoder_resize((540, 960))                    med = 4645.45, mean = 4658.60 +- 29.03, min = 4633.45, max = 4702.54 - in ms
torchvision_crop((540, 960))                  med = 4938.24, mean = 4931.62 +- 55.29, min = 4846.19, max = 4992.57 - in ms
decoder_crop((540, 960))                      med = 4345.87, mean = 4362.94 +- 62.46, min = 4289.98, max = 4445.79 - in ms

torchvision_resize((270, 480))                med = 5687.46, mean = 5694.70 +- 47.98, min = 5643.08, max = 5764.25 - in ms
decoder_resize((270, 480))                    med = 4448.39, mean = 4476.09 +- 62.34, min = 4411.99, max = 4571.51 - in ms
torchvision_crop((270, 480))                  med = 4934.87, mean = 4960.69 +- 74.59, min = 4908.33, max = 5090.58 - in ms
decoder_crop((270, 480))                      med = 4275.85, mean = 4260.12 +- 48.30, min = 4185.06, max = 4304.83 - in ms

torchvision_resize((135, 240))                med = 5404.12, mean = 5418.91 +- 63.23, min = 5354.71, max = 5520.74 - in ms
decoder_resize((135, 240))                    med = 4383.50, mean = 4390.66 +- 20.54, min = 4376.85, max = 4426.62 - in ms
torchvision_crop((135, 240))                  med = 4862.93, mean = 4866.41 +- 44.02, min = 4809.66, max = 4920.67 - in ms
decoder_crop((135, 240))                      med = 4239.89, mean = 4224.09 +- 41.37, min = 4173.58, max = 4268.28 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 8099.23, mean = 8132.06 +- 107.66, min = 8036.62, max = 8286.54 - in ms
decoder_resize((540, 960))                    med = 5082.38, mean = 5053.43 +- 54.25, min = 4991.27, max = 5110.40 - in ms
torchvision_crop((540, 960))                  med = 5286.84, mean = 5298.40 +- 19.91, min = 5281.03, max = 5326.81 - in ms
decoder_crop((540, 960))                      med = 4561.46, mean = 4567.39 +- 27.28, min = 4541.30, max = 4605.51 - in ms

torchvision_resize((270, 480))                med = 6897.25, mean = 6878.87 +- 65.91, min = 6789.60, max = 6938.21 - in ms
decoder_resize((270, 480))                    med = 4813.42, mean = 4823.67 +- 36.58, min = 4792.88, max = 4882.58 - in ms
torchvision_crop((270, 480))                  med = 5203.32, mean = 5201.83 +- 50.65, min = 5131.86, max = 5254.45 - in ms
decoder_crop((270, 480))                      med = 4576.46, mean = 4557.36 +- 36.09, min = 4499.04, max = 4586.19 - in ms

torchvision_resize((135, 240))                med = 6284.11, mean = 6267.33 +- 71.49, min = 6144.03, max = 6321.89 - in ms
decoder_resize((135, 240))                    med = 4805.02, mean = 4779.94 +- 88.93, min = 4679.31, max = 4873.81 - in ms
torchvision_crop((135, 240))                  med = 5211.46, mean = 5227.46 +- 58.04, min = 5184.95, max = 5327.92 - in ms
decoder_crop((135, 240))                      med = 4523.77, mean = 4523.37 +- 39.47, min = 4467.71, max = 4572.99 - in ms

And then with 1 thread:

$ python benchmarks/decoders/benchmark_transforms.py --path mandelbrot.mp4 --num-exp 5 --num-threads 1
Benchmarking mandelbrot.mp4, duration: 120.0, codec: h264, averaging over 5 runs:
Sampling 0.5%, 36, of 7200 frames
torchvision_resize((540, 960))                med = 18131.61, mean = 18102.38 +- 149.62, min = 17888.27, max = 18268.98 - in ms
decoder_resize((540, 960))                    med = 18201.50, mean = 18261.95 +- 137.81, min = 18154.36, max = 18485.10 - in ms
torchvision_crop((540, 960))                  med = 17877.96, mean = 17861.91 +- 56.63, min = 17773.04, max = 17911.99 - in ms
decoder_crop((540, 960))                      med = 17771.08, mean = 17758.01 +- 41.33, min = 17685.27, max = 17785.52 - in ms

torchvision_resize((270, 480))                med = 18006.64, mean = 18032.64 +- 84.55, min = 17935.69, max = 18163.09 - in ms
decoder_resize((270, 480))                    med = 18095.68, mean = 18200.20 +- 252.82, min = 17981.77, max = 18563.14 - in ms
torchvision_crop((270, 480))                  med = 18023.64, mean = 18018.42 +- 63.02, min = 17917.22, max = 18088.02 - in ms
decoder_crop((270, 480))                      med = 18044.71, mean = 17989.09 +- 115.62, min = 17801.89, max = 18078.01 - in ms

torchvision_resize((135, 240))                med = 17934.54, mean = 17918.85 +- 46.36, min = 17846.72, max = 17960.52 - in ms
decoder_resize((135, 240))                    med = 17929.54, mean = 18003.14 +- 149.91, min = 17893.80, max = 18255.33 - in ms
torchvision_crop((135, 240))                  med = 17878.26, mean = 17916.97 +- 82.69, min = 17841.21, max = 18015.84 - in ms
decoder_crop((135, 240))                      med = 17987.70, mean = 18215.67 +- 601.03, min = 17706.06, max = 19231.63 - in ms

Sampling 1.0%, 72, of 7200 frames
torchvision_resize((540, 960))                med = 22877.97, mean = 22989.13 +- 303.27, min = 22754.35, max = 23510.40 - in ms
decoder_resize((540, 960))                    med = 23253.30, mean = 23346.07 +- 348.92, min = 23127.41, max = 23959.95 - in ms
torchvision_crop((540, 960))                  med = 22864.57, mean = 22883.20 +- 141.02, min = 22691.47, max = 23084.99 - in ms
decoder_crop((540, 960))                      med = 22561.47, mean = 22702.42 +- 281.03, min = 22425.31, max = 23024.37 - in ms

torchvision_resize((270, 480))                med = 22713.37, mean = 22755.43 +- 104.58, min = 22665.30, max = 22904.19 - in ms
decoder_resize((270, 480))                    med = 22884.69, mean = 22900.10 +- 104.08, min = 22764.80, max = 23028.52 - in ms
torchvision_crop((270, 480))                  med = 22547.32, mean = 22549.32 +- 52.01, min = 22473.02, max = 22618.18 - in ms
decoder_crop((270, 480))                      med = 22401.76, mean = 22447.36 +- 96.94, min = 22354.52, max = 22592.69 - in ms

torchvision_resize((135, 240))                med = 23047.19, mean = 22925.35 +- 205.85, min = 22693.56, max = 23089.60 - in ms
decoder_resize((135, 240))                    med = 22967.58, mean = 22949.48 +- 110.34, min = 22828.10, max = 23103.66 - in ms
torchvision_crop((135, 240))                  med = 22641.84, mean = 22623.04 +- 32.86, min = 22571.59, max = 22649.87 - in ms
decoder_crop((135, 240))                      med = 22435.13, mean = 22428.62 +- 72.18, min = 22333.83, max = 22527.30 - in ms

Sampling 5.0%, 360, of 7200 frames
torchvision_resize((540, 960))                med = 28903.91, mean = 28989.61 +- 176.48, min = 28875.88, max = 29293.35 - in ms
decoder_resize((540, 960))                    med = 30762.54, mean = 30778.12 +- 319.32, min = 30477.67, max = 31277.21 - in ms
torchvision_crop((540, 960))                  med = 27651.92, mean = 27626.28 +- 71.92, min = 27523.07, max = 27697.98 - in ms
decoder_crop((540, 960))                      med = 26738.39, mean = 26766.57 +- 183.99, min = 26521.77, max = 27033.37 - in ms

torchvision_resize((270, 480))                med = 28278.45, mean = 28287.42 +- 64.26, min = 28213.16, max = 28361.05 - in ms
decoder_resize((270, 480))                    med = 29079.33, mean = 29108.04 +- 160.01, min = 28950.96, max = 29334.79 - in ms
torchvision_crop((270, 480))                  med = 27563.43, mean = 27612.72 +- 126.03, min = 27500.05, max = 27807.44 - in ms
decoder_crop((270, 480))                      med = 26819.88, mean = 26860.30 +- 175.75, min = 26637.92, max = 27069.36 - in ms

torchvision_resize((135, 240))                med = 28307.79, mean = 28279.58 +- 355.32, min = 27944.01, max = 28835.09 - in ms
decoder_resize((135, 240))                    med = 29654.80, mean = 29560.30 +- 392.00, min = 28906.78, max = 29958.71 - in ms
torchvision_crop((135, 240))                  med = 27383.97, mean = 27440.17 +- 167.43, min = 27270.75, max = 27697.96 - in ms
decoder_crop((135, 240))                      med = 26722.81, mean = 26701.36 +- 89.16, min = 26553.66, max = 26792.45 - in ms

Sampling 10.0%, 720, of 7200 frames
torchvision_resize((540, 960))                med = 32284.64, mean = 32356.82 +- 303.45, min = 32122.33, max = 32883.67 - in ms
decoder_resize((540, 960))                    med = 36423.08, mean = 36481.91 +- 185.25, min = 36311.75, max = 36796.18 - in ms
torchvision_crop((540, 960))                  med = 29460.67, mean = 29480.52 +- 260.56, min = 29201.84, max = 29870.92 - in ms
decoder_crop((540, 960))                      med = 27939.71, mean = 27958.15 +- 148.42, min = 27826.73, max = 28181.93 - in ms

torchvision_resize((270, 480))                med = 30866.56, mean = 30949.88 +- 237.82, min = 30778.52, max = 31362.19 - in ms
decoder_resize((270, 480))                    med = 33179.70, mean = 33189.49 +- 275.80, min = 32913.54, max = 33616.09 - in ms
torchvision_crop((270, 480))                  med = 29637.93, mean = 29608.11 +- 129.47, min = 29463.67, max = 29765.43 - in ms
decoder_crop((270, 480))                      med = 27892.94, mean = 27887.00 +- 108.59, min = 27752.73, max = 28048.34 - in ms

torchvision_resize((135, 240))                med = 30484.33, mean = 30472.07 +- 107.43, min = 30335.45, max = 30612.89 - in ms
decoder_resize((135, 240))                    med = 31971.49, mean = 31948.56 +- 145.47, min = 31735.10, max = 32103.10 - in ms
torchvision_crop((135, 240))                  med = 29441.21, mean = 29542.92 +- 362.14, min = 29204.78, max = 30123.61 - in ms
decoder_crop((135, 240))                      med = 28146.61, mean = 28157.64 +- 196.44, min = 27905.72, max = 28416.11 - in ms

I'm going to focus on the difference between the largest number of frames sampled with the largest size reduction with 0 and 1 threads:

0 threads:
Sampling 10.0%, 720, of 7200 frames
torchvision_resize((135, 240))                med = 6284.11, mean = 6267.33 +- 71.49, min = 6144.03, max = 6321.89 - in ms
decoder_resize((135, 240))                    med = 4805.02, mean = 4779.94 +- 88.93, min = 4679.31, max = 4873.81 - in ms
torchvision_crop((135, 240))                  med = 5211.46, mean = 5227.46 +- 58.04, min = 5184.95, max = 5327.92 - in ms
decoder_crop((135, 240))                      med = 4523.77, mean = 4523.37 +- 39.47, min = 4467.71, max = 4572.99 - in ms

1 thread:
Sampling 10.0%, 720, of 7200 frames
torchvision_resize((135, 240))                med = 30484.33, mean = 30472.07 +- 107.43, min = 30335.45, max = 30612.89 - in ms
decoder_resize((135, 240))                    med = 31971.49, mean = 31948.56 +- 145.47, min = 31735.10, max = 32103.10 - in ms
torchvision_crop((135, 240))                  med = 29441.21, mean = 29542.92 +- 362.14, min = 29204.78, max = 30123.61 - in ms
decoder_crop((135, 240))                      med = 28146.61, mean = 28157.64 +- 196.44, min = 27905.72, max = 28416.11 - in ms

Importantly, the number of threads changes the relative costs between using decoder transforms versus passing the fully decoded frame to TorchVision. I cannot yet fully explains these results, but possible factors are:

Startup cost of filtergraph. The fully decoded frames are not using filtergraph, but just using swscale directly for the color conversion. Resize transform optimization: use swscale when appropriate #1018 could address this case. I am skeptical that the startup cost of filtergraph is the primary cause, though, as I would think it would be amortized over the number of frames and the large resolution.
It's possible the TorchVision resize implementation is faster than sws_scale() in swscale. We know TorchVision's resize is SIMDized, and libswscale appears to be as well. (See: https://github.com/FFmpeg/FFmpeg/tree/master/libswscale/x86.)
The reason the threads improve performance so much is that it's also parallelizing the resize (and crop). For the TorchVision use-case, the TorchVision transforms are applied serially.
In our filtergraph, we're explicitly prepending format=rgb24 before the transforms, including resize. This ensures that the transforms are applied in the output colorspace, but it also forces an extra call to sws_scale(). Not forcing this buys about 7-10% of the performance back. However, in principle, this should be basically the same as just applying the TorchVision resize after a normal decoding.

Memory Performance

Here's the good news: decoder transforms are radically more memory efficient. I haven't been able to instrument the benchmarks yet to capture this, but I can observe it through top.

If I hack the benchmark to only run torchvision_resize(), then the RSS grows to 4.3 GB in cycles. That is, it gets that high, goes down way less than 1 GB, then grows again. I think we're observing the Python garbage collector kicking in.

If I hack the benchmark to only run decoder_resize(), then the RSS stays at 0.4 MB the entire time.

NicolasHug

Thanks @scotts !

NicolasHug · 2025-12-10T10:21:34Z

benchmarks/decoders/benchmark_transforms.py

+    parser.add_argument(
+        "--num-threads",
+        type=int,
+        default=1,
+        help="number of threads to use; 0 means FFmpeg decides",
+    )


We might want to also call torch.set_num_threads(args.num_threads) when num_threads != 0? In the current conditions of the benchmark, I think torch's resize isn't multi-threaded, so this should have no effect. But there are code paths where it is multithreaded over the batch dimension, depending on the input dtype and the interpolation mode (example: https://github.com/pytorch/pytorch/blame/afb173d9b9440d804b5f77d0c291e53c720d1fcf/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L2024C18-L2024C18).

If it was as clean as just providing this value to the Torch APIs, I think we should just do it now. But because of the weirdness with 0 (I don't think torch.set_num_threads() has an equivalent for automatic deciding), I think we may want to control it with another flag or do some logic (n_cpus // 2). For those reasons, I'd rather punt on that until we want to get numbers for those scenarios.

Fix transform benchmarks

88680a1

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 10, 2025

scotts marked this pull request as ready for review December 10, 2025 03:17

scotts added 2 commits December 9, 2025 19:57

Remove default number of exps

19be259

Lint

260c312

NicolasHug approved these changes Dec 10, 2025

View reviewed changes

scotts merged commit 0574243 into meta-pytorch:main Dec 10, 2025
67 of 68 checks passed

scotts deleted the transform_benchmarks branch December 10, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix transform benchmarks #1118

Fix transform benchmarks #1118

Uh oh!

scotts commented Dec 10, 2025 •

edited

Loading

Uh oh!

NicolasHug left a comment

Uh oh!

NicolasHug Dec 10, 2025

Uh oh!

scotts Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix transform benchmarks #1118

Fix transform benchmarks #1118

Uh oh!

Conversation

scotts commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runtime Performance

Memory Performance

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotts commented Dec 10, 2025 •

edited

Loading