⚡️ Speed up function decode by 58%
#91
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 58% (0.58x) speedup for
decodeininvokeai/backend/image_util/dw_openpose/onnxpose.py⏱️ Runtime :
2.10 milliseconds→1.33 milliseconds(best of208runs)📝 Explanation and details
The optimized code achieves a 58% speedup through several key NumPy performance optimizations in the
get_simcc_maximumfunction:Primary Optimizations:
Eliminated redundant
np.amaxcalls: The original code callednp.amaxtwice to find maximum values after already computing indices withnp.argmax. The optimized version uses advanced indexing (array[indices]) to extract the maximum values directly, eliminating two expensive global reductions.Replaced masking with
np.minimum: Instead of creating a boolean mask and conditionally copying values (max_val_x[mask] = max_val_y[mask]), the code now usesnp.minimum(max_val_x, max_val_y)which is a single vectorized operation.Reduced array allocations: The original
np.stack(...).astype(np.float32)creates temporary arrays and performs type conversion. The optimized version pre-allocateslocswith the correct dtype and fills it directly, avoiding intermediate arrays.Minor division optimization: Changed in-place division (
/=) to regular division (/) in thedecodefunction to avoid potential dtype upcasting overhead.Performance Impact:
The line profiler shows the most significant gains come from eliminating the expensive
np.amaxoperations (originally 20.2% + 11.1% = 31.3% of total time) and thenp.stackoperation (21.7% of total time). The test results demonstrate consistent 35-100% speedups across various input sizes, with particularly strong performance on larger arrays where the vectorized operations provide maximum benefit.This optimization is especially valuable for computer vision workloads processing pose estimation data, where these functions are likely called frequently on moderately-sized arrays (typical test cases show N×K×W dimensions of 1×1×3 to 500×2×3).
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-decode-mhn4wpxcand push.