Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 6, 2025

📄 58% (0.58x) speedup for decode in invokeai/backend/image_util/dw_openpose/onnxpose.py

⏱️ Runtime : 2.10 milliseconds 1.33 milliseconds (best of 208 runs)

📝 Explanation and details

The optimized code achieves a 58% speedup through several key NumPy performance optimizations in the get_simcc_maximum function:

Primary Optimizations:

  1. Eliminated redundant np.amax calls: The original code called np.amax twice to find maximum values after already computing indices with np.argmax. The optimized version uses advanced indexing (array[indices]) to extract the maximum values directly, eliminating two expensive global reductions.

  2. Replaced masking with np.minimum: Instead of creating a boolean mask and conditionally copying values (max_val_x[mask] = max_val_y[mask]), the code now uses np.minimum(max_val_x, max_val_y) which is a single vectorized operation.

  3. Reduced array allocations: The original np.stack(...).astype(np.float32) creates temporary arrays and performs type conversion. The optimized version pre-allocates locs with the correct dtype and fills it directly, avoiding intermediate arrays.

  4. Minor division optimization: Changed in-place division (/=) to regular division (/) in the decode function to avoid potential dtype upcasting overhead.

Performance Impact:
The line profiler shows the most significant gains come from eliminating the expensive np.amax operations (originally 20.2% + 11.1% = 31.3% of total time) and the np.stack operation (21.7% of total time). The test results demonstrate consistent 35-100% speedups across various input sizes, with particularly strong performance on larger arrays where the vectorized operations provide maximum benefit.

This optimization is especially valuable for computer vision workloads processing pose estimation data, where these functions are likely called frequently on moderately-sized arrays (typical test cases show N×K×W dimensions of 1×1×3 to 500×2×3).

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 30 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import numpy as np
# imports
import pytest
from invokeai.backend.image_util.dw_openpose.onnxpose import decode


# unit tests
class TestDecodeBasic:
    def test_single_instance_single_keypoint(self):
        # Basic: 1 instance, 1 keypoint, Wx=Wy=3, split_ratio=1
        simcc_x = np.array([[[0.1, 0.9, 0.3]]])
        simcc_y = np.array([[[0.2, 0.6, 0.5]]])
        split_ratio = 1
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 46.9μs -> 31.6μs (48.4% faster)

    def test_multiple_keypoints(self):
        # Basic: 1 instance, 2 keypoints, Wx=Wy=4, split_ratio=2
        simcc_x = np.array([[[0.1, 0.5, 0.9, 0.3],
                             [0.2, 0.8, 0.1, 0.4]]])
        simcc_y = np.array([[[0.2, 0.6, 0.5, 0.7],
                             [0.9, 0.1, 0.3, 0.2]]])
        split_ratio = 2
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 44.0μs -> 29.8μs (47.6% faster)

    def test_multiple_instances(self):
        # Basic: 2 instances, 1 keypoint each, Wx=Wy=3, split_ratio=1
        simcc_x = np.array([[[0.1, 0.5, 0.3]], [[0.2, 0.8, 0.4]]])
        simcc_y = np.array([[[0.2, 0.6, 0.5]], [[0.9, 0.1, 0.3]]])
        split_ratio = 1
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 41.0μs -> 29.3μs (39.9% faster)

class TestDecodeEdge:
    def test_all_zeros(self):
        # Edge: All zero input, should set locs to -1 and scores to 0
        simcc_x = np.zeros((1,2,3))
        simcc_y = np.zeros((1,2,3))
        split_ratio = 1
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 43.0μs -> 29.6μs (45.1% faster)

    def test_negative_values(self):
        # Edge: Negative values, max still picked, but locs set to -1 if max <= 0
        simcc_x = np.array([[[ -1, -2, -3]]])
        simcc_y = np.array([[[ -4, -5, -6]]])
        split_ratio = 1
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 43.9μs -> 30.0μs (46.6% faster)

    def test_maximum_at_multiple_locations(self):
        # Edge: Multiple max values, np.argmax returns first occurrence
        simcc_x = np.array([[[0.5, 0.5, 0.1]]])
        simcc_y = np.array([[[0.2, 0.5, 0.5]]])
        split_ratio = 1
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 40.2μs -> 28.2μs (42.4% faster)

    def test_split_ratio_float(self):
        # Edge: split_ratio is float
        simcc_x = np.array([[[0.1, 0.5, 0.3]]])
        simcc_y = np.array([[[0.2, 0.6, 0.5]]])
        split_ratio = 2.0
        keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 39.4μs -> 27.5μs (43.2% faster)

    
#------------------------------------------------
import numpy as np
# imports
import pytest
from invokeai.backend.image_util.dw_openpose.onnxpose import decode

# ------------------- UNIT TESTS -------------------

# 1. Basic Test Cases

def test_decode_basic_single_kp():
    # One keypoint, one instance, Wx=Wy=5, split_ratio=2
    simcc_x = np.array([[[0, 1, 3, 2, 0]]], dtype=np.float32)  # shape (1, 1, 5)
    simcc_y = np.array([[[0, 2, 0, 4, 1]]], dtype=np.float32)  # shape (1, 1, 5)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 62.0μs -> 46.1μs (34.6% faster)

def test_decode_basic_multi_kp():
    # Two keypoints, one instance, Wx=Wy=4, split_ratio=1
    simcc_x = np.array([[[0, 2, 1, 0], [1, 0, 0, 3]]], dtype=np.float32)  # (1, 2, 4)
    simcc_y = np.array([[[1, 0, 0, 2], [0, 4, 0, 1]]], dtype=np.float32)  # (1, 2, 4)
    split_ratio = 1
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 48.3μs -> 33.5μs (44.4% faster)

def test_decode_basic_multi_instance():
    # Two instances, one keypoint, Wx=Wy=3, split_ratio=3
    simcc_x = np.array([[[1, 2, 0]], [[0, 1, 3]]], dtype=np.float32)  # (2, 1, 3)
    simcc_y = np.array([[[2, 0, 1]], [[1, 0, 2]]], dtype=np.float32)  # (2, 1, 3)
    split_ratio = 3
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 45.6μs -> 31.4μs (45.2% faster)

# 2. Edge Test Cases

def test_decode_all_zeros():
    # All zeros should result in keypoints -1 and scores 0
    simcc_x = np.zeros((1, 2, 4), dtype=np.float32)
    simcc_y = np.zeros((1, 2, 4), dtype=np.float32)
    split_ratio = 1
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 44.1μs -> 30.7μs (43.4% faster)

def test_decode_negative_values():
    # Negative values, but max is still negative, so scores <= 0
    simcc_x = np.array([[[-1, -2, -3, -4]]], dtype=np.float32)
    simcc_y = np.array([[[-4, -3, -2, -1]]], dtype=np.float32)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 42.8μs -> 30.1μs (42.0% faster)

def test_decode_equal_max_values():
    # max_val_x == max_val_y, mask False, vals = max_val_x
    simcc_x = np.array([[[2, 1, 2, 1]]], dtype=np.float32)
    simcc_y = np.array([[[1, 2, 1, 2]]], dtype=np.float32)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 41.0μs -> 28.3μs (44.5% faster)

def test_decode_split_ratio_one():
    # split_ratio = 1, keypoints should not change
    simcc_x = np.array([[[0, 0, 5, 0]]], dtype=np.float32)
    simcc_y = np.array([[[0, 7, 0, 0]]], dtype=np.float32)
    split_ratio = 1
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 40.8μs -> 28.2μs (44.4% faster)

def test_decode_split_ratio_float():
    # split_ratio as float, keypoints should be divided by float
    simcc_x = np.array([[[0, 0, 6, 0]]], dtype=np.float32)
    simcc_y = np.array([[[0, 8, 0, 0]]], dtype=np.float32)
    split_ratio = 2.0
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 39.6μs -> 28.2μs (40.4% faster)

def test_decode_large_split_ratio():
    # Large split_ratio, keypoints should be very small
    simcc_x = np.array([[[0, 0, 10, 0]]], dtype=np.float32)
    simcc_y = np.array([[[0, 10, 0, 0]]], dtype=np.float32)
    split_ratio = 1000
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 39.8μs -> 28.2μs (41.1% faster)

def test_decode_non_integer_split_ratio():
    # Non-integer split ratio, e.g. 2.5
    simcc_x = np.array([[[0, 0, 5, 0]]], dtype=np.float32)
    simcc_y = np.array([[[0, 7, 0, 0]]], dtype=np.float32)
    split_ratio = 2.5
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 39.0μs -> 27.4μs (42.2% faster)

def test_decode_multiple_maxima():
    # Multiple maxima, np.argmax returns first occurrence
    simcc_x = np.array([[[3, 3, 1, 0]]], dtype=np.float32)
    simcc_y = np.array([[[2, 2, 0, 0]]], dtype=np.float32)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 42.9μs -> 27.7μs (54.5% faster)

def test_decode_shape_mismatch_raises():
    # simcc_x and simcc_y must have same shape except last dim
    simcc_x = np.zeros((1, 2, 4), dtype=np.float32)
    simcc_y = np.zeros((1, 2, 5), dtype=np.float32)
    split_ratio = 1
    # Should not raise, as only last dim differs and code handles it
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 43.7μs -> 29.2μs (49.3% faster)


def test_decode_singleton_arrays():
    # Singleton arrays
    simcc_x = np.array([[[42]]], dtype=np.float32)
    simcc_y = np.array([[[17]]], dtype=np.float32)
    split_ratio = 1
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 66.2μs -> 48.5μs (36.5% faster)

# 3. Large Scale Test Cases

def test_decode_large_n():
    # Large N, K=2, Wx=Wy=3
    N = 500
    K = 2
    Wx = Wy = 3
    simcc_x = np.random.rand(N, K, Wx).astype(np.float32)
    simcc_y = np.random.rand(N, K, Wy).astype(np.float32)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 148μs -> 74.2μs (99.9% faster)

def test_decode_large_k():
    # Large K, N=1, Wx=Wy=4
    K = 800
    simcc_x = np.random.rand(1, K, 4).astype(np.float32)
    simcc_y = np.random.rand(1, K, 4).astype(np.float32)
    split_ratio = 2
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 138μs -> 69.9μs (97.5% faster)

def test_decode_large_wx_wy():
    # Large Wx/Wy, N=1, K=1, Wx=Wy=999
    Wx = Wy = 999
    simcc_x = np.zeros((1, 1, Wx), dtype=np.float32)
    simcc_y = np.zeros((1, 1, Wy), dtype=np.float32)
    simcc_x[0, 0, 998] = 1.0
    simcc_y[0, 0, 997] = 2.0
    split_ratio = 3
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 45.4μs -> 31.2μs (45.7% faster)

def test_decode_large_random():
    # Large random arrays, shape (10, 50, 20)
    N, K, Wx, Wy = 10, 50, 20, 20
    simcc_x = np.random.rand(N, K, Wx).astype(np.float32)
    simcc_y = np.random.rand(N, K, Wy).astype(np.float32)
    split_ratio = 5
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 143μs -> 85.1μs (68.2% faster)

def test_decode_performance():
    # Performance test: shape (20, 20, 50)
    N, K, Wx, Wy = 20, 20, 50, 50
    simcc_x = np.random.rand(N, K, Wx).astype(np.float32)
    simcc_y = np.random.rand(N, K, Wy).astype(np.float32)
    split_ratio = 10
    keypoints, scores = decode(simcc_x, simcc_y, split_ratio) # 146μs -> 98.1μs (49.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-decode-mhn4wpxc and push.

Codeflash Static Badge

The optimized code achieves a **58% speedup** through several key NumPy performance optimizations in the `get_simcc_maximum` function:

**Primary Optimizations:**

1. **Eliminated redundant `np.amax` calls**: The original code called `np.amax` twice to find maximum values after already computing indices with `np.argmax`. The optimized version uses advanced indexing (`array[indices]`) to extract the maximum values directly, eliminating two expensive global reductions.

2. **Replaced masking with `np.minimum`**: Instead of creating a boolean mask and conditionally copying values (`max_val_x[mask] = max_val_y[mask]`), the code now uses `np.minimum(max_val_x, max_val_y)` which is a single vectorized operation.

3. **Reduced array allocations**: The original `np.stack(...).astype(np.float32)` creates temporary arrays and performs type conversion. The optimized version pre-allocates `locs` with the correct dtype and fills it directly, avoiding intermediate arrays.

4. **Minor division optimization**: Changed in-place division (`/=`) to regular division (`/`) in the `decode` function to avoid potential dtype upcasting overhead.

**Performance Impact:**
The line profiler shows the most significant gains come from eliminating the expensive `np.amax` operations (originally 20.2% + 11.1% = 31.3% of total time) and the `np.stack` operation (21.7% of total time). The test results demonstrate consistent 35-100% speedups across various input sizes, with particularly strong performance on larger arrays where the vectorized operations provide maximum benefit.

This optimization is especially valuable for computer vision workloads processing pose estimation data, where these functions are likely called frequently on moderately-sized arrays (typical test cases show N×K×W dimensions of 1×1×3 to 500×2×3).
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 6, 2025 07:58
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant