⚡️ Speed up function inference_pose by 74%
#92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 74% (0.74x) speedup for
inference_poseininvokeai/backend/image_util/dw_openpose/onnxpose.py⏱️ Runtime :
3.09 seconds→1.78 seconds(best of6runs)📝 Explanation and details
The optimization achieves a 73% speedup by eliminating memory allocation overhead and reducing redundant API calls in the image preprocessing pipeline.
Key optimizations:
In-place normalization with precomputed constants: The original code created new
meanandstdarrays 1,779 times per run. The optimized version uses global_MEANand_STDconstants and performs normalization in-place withnp.subtract()andnp.divide(), reducing the normalization time from 64.3% to 37.7% of preprocessing time.Reduced ONNX session overhead: The original code called
sess.get_outputs()and built the output list inside the loop for each image. The optimization moves these calls outside the loop, reducing inference overhead from 20% to 14% of session time.Float32 consistency: Using
dtype=np.float32for bounding boxes and intermediate arrays aligns with typical ONNX model expectations, avoiding unnecessary type conversions.Vectorized postprocessing: Precomputing
input_size_arras a numpy array enables more efficient broadcasting in keypoint rescaling operations.Performance impact: The optimizations are particularly effective for workloads with many bounding boxes, as shown in the test results where cases with 100+ boxes see 50-90% speedups. Single bbox cases still benefit from 25-40% improvements due to the eliminated allocations. The optimizations maintain identical mathematical behavior while significantly reducing memory churn in the preprocessing hot path.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-inference_pose-mhn54zuaand push.