Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 6, 2025

📄 74% (0.74x) speedup for inference_pose in invokeai/backend/image_util/dw_openpose/onnxpose.py

⏱️ Runtime : 3.09 seconds 1.78 seconds (best of 6 runs)

📝 Explanation and details

The optimization achieves a 73% speedup by eliminating memory allocation overhead and reducing redundant API calls in the image preprocessing pipeline.

Key optimizations:

  1. In-place normalization with precomputed constants: The original code created new mean and std arrays 1,779 times per run. The optimized version uses global _MEAN and _STD constants and performs normalization in-place with np.subtract() and np.divide(), reducing the normalization time from 64.3% to 37.7% of preprocessing time.

  2. Reduced ONNX session overhead: The original code called sess.get_outputs() and built the output list inside the loop for each image. The optimization moves these calls outside the loop, reducing inference overhead from 20% to 14% of session time.

  3. Float32 consistency: Using dtype=np.float32 for bounding boxes and intermediate arrays aligns with typical ONNX model expectations, avoiding unnecessary type conversions.

  4. Vectorized postprocessing: Precomputing input_size_arr as a numpy array enables more efficient broadcasting in keypoint rescaling operations.

Performance impact: The optimizations are particularly effective for workloads with many bounding boxes, as shown in the test results where cases with 100+ boxes see 50-90% speedups. Single bbox cases still benefit from 25-40% improvements due to the eliminated allocations. The optimizations maintain identical mathematical behavior while significantly reducing memory churn in the preprocessing hot path.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 29 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import cv2
import numpy as np
# imports
import pytest
from invokeai.backend.image_util.dw_openpose.onnxpose import \
    inference_pose  # --- End: function to test ---


# --- Begin: Dummy ONNX session for testing ---
class DummyInput:
    def __init__(self, name, shape):
        self.name = name
        self.shape = shape

class DummyOutput:
    def __init__(self, name):
        self.name = name

class DummySession:
    def __init__(self, input_shape, num_keypoints=3, wx=5, wy=5):
        self._inputs = [DummyInput("input", input_shape)]
        self._outputs = [DummyOutput("simcc_x"), DummyOutput("simcc_y")]
        self._num_keypoints = num_keypoints
        self._wx = wx
        self._wy = wy
        self._call_count = 0

    def get_inputs(self):
        return self._inputs

    def get_outputs(self):
        return self._outputs

    def run(self, output_names, input_dict):
        # Return deterministic fake simcc_x and simcc_y for test
        # Shape: (1, K, Wx), (1, K, Wy)
        simcc_x = np.zeros((1, self._num_keypoints, self._wx), dtype=np.float32)
        simcc_y = np.zeros((1, self._num_keypoints, self._wy), dtype=np.float32)
        # Set a different max for each keypoint for deterministic test
        for k in range(self._num_keypoints):
            simcc_x[0, k, k % self._wx] = 1.0
            simcc_y[0, k, (self._num_keypoints - k - 1) % self._wy] = 0.5
        return [simcc_x, simcc_y]
# --- End: Dummy ONNX session ---

# --- Begin: Unit tests ---
# 1. Basic Test Cases

def test_basic_single_person_bbox():
    """
    Test inference_pose with a single bounding box and a simple image.
    """
    img = np.ones((256, 192, 3), dtype=np.uint8) * 128  # uniform gray image
    bbox = [[10, 20, 110, 220]]  # one bbox
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=3, wx=5, wy=5)
    keypoints, scores = inference_pose(session, bbox, img) # 1.21ms -> 920μs (31.5% faster)
    # Keypoints and scores should be deterministic
    # For simcc_x: max at 0,1,2; for simcc_y: max at 2,1,0 (see DummySession)
    expected_x = np.array([0, 1, 2], dtype=np.float32) / 2.0  # simcc_split_ratio=2.0
    expected_y = np.array([2, 1, 0], dtype=np.float32) / 2.0
    for k in range(3):
        pass

def test_basic_no_bbox():
    """
    Test inference_pose with empty bbox (should default to whole image).
    """
    img = np.zeros((256, 192, 3), dtype=np.uint8)
    bbox = []
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=2, wx=4, wy=4)
    keypoints, scores = inference_pose(session, bbox, img) # 1.19ms -> 859μs (38.5% faster)

def test_basic_multiple_bboxes():
    """
    Test inference_pose with multiple bounding boxes (multiple people).
    """
    img = np.random.randint(0, 255, (256, 192, 3), dtype=np.uint8)
    bbox = [
        [0, 0, 50, 100],
        [60, 80, 120, 200]
    ]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=4, wx=6, wy=6)
    keypoints, scores = inference_pose(session, bbox, img) # 2.25ms -> 1.65ms (36.2% faster)

# 2. Edge Test Cases

def test_edge_empty_image():
    """
    Test with an empty image (all zeros).
    """
    img = np.zeros((256, 192, 3), dtype=np.uint8)
    bbox = [[0, 0, 192, 256]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=1, wx=3, wy=3)
    keypoints, scores = inference_pose(session, bbox, img) # 1.15ms -> 854μs (34.2% faster)

def test_edge_bbox_out_of_bounds():
    """
    Bounding box is partially or fully outside the image.
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [
        [-20, -20, 50, 50],  # partially out
        [190, 250, 300, 400]  # fully out
    ]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=2, wx=4, wy=4)
    keypoints, scores = inference_pose(session, bbox, img) # 2.20ms -> 1.59ms (37.9% faster)

def test_edge_minimal_bbox():
    """
    Very small bounding box (single pixel).
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [[10, 10, 11, 11]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=1, wx=2, wy=2)
    keypoints, scores = inference_pose(session, bbox, img) # 1.14ms -> 865μs (32.0% faster)

def test_edge_non_integer_bbox():
    """
    Bounding box with float coordinates.
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [[10.5, 20.2, 110.8, 220.9]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=2, wx=3, wy=3)
    keypoints, scores = inference_pose(session, bbox, img) # 1.14ms -> 840μs (35.5% faster)

def test_edge_different_image_size():
    """
    Image size not matching model input size.
    """
    img = np.ones((300, 400, 3), dtype=np.uint8)
    bbox = [[50, 60, 350, 290]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=3, wx=5, wy=5)
    keypoints, scores = inference_pose(session, bbox, img) # 1.18ms -> 852μs (38.0% faster)

def test_edge_large_bbox():
    """
    Bounding box larger than image.
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [[-100, -100, 400, 500]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=2, wx=3, wy=3)
    keypoints, scores = inference_pose(session, bbox, img) # 1.11ms -> 811μs (37.3% faster)

def test_edge_zero_area_bbox():
    """
    Bounding box with zero area (x0 == x1 and/or y0 == y1).
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [[50, 50, 50, 50], [80, 90, 120, 90]]  # zero width, zero height
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=1, wx=2, wy=2)
    keypoints, scores = inference_pose(session, bbox, img) # 2.21ms -> 1.64ms (34.3% faster)

# 3. Large Scale Test Cases

def test_large_many_bboxes():
    """
    Test with a large number of bounding boxes (up to 1000).
    """
    img = np.random.randint(0, 255, (256, 192, 3), dtype=np.uint8)
    num_boxes = 1000
    bbox = [[i, i, i+10, i+20] for i in range(num_boxes)]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=1, wx=2, wy=2)
    keypoints, scores = inference_pose(session, bbox, img) # 1.45s -> 750ms (92.7% faster)

def test_large_image():
    """
    Test with a large image (max 1000x1000).
    """
    img = np.ones((1000, 1000, 3), dtype=np.uint8)
    bbox = [[100, 100, 900, 900]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=5, wx=5, wy=5)
    keypoints, scores = inference_pose(session, bbox, img) # 1.34ms -> 1.02ms (31.1% faster)

def test_large_keypoints():
    """
    Test with a large number of keypoints (e.g. 100).
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    bbox = [[0, 0, 192, 256]]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=100, wx=10, wy=10)
    keypoints, scores = inference_pose(session, bbox, img) # 1.22ms -> 915μs (33.3% faster)

def test_large_keypoints_and_bboxes():
    """
    Test with both many keypoints and many bboxes (stress test).
    """
    img = np.ones((256, 192, 3), dtype=np.uint8)
    num_boxes = 100
    num_keypoints = 50
    bbox = [[i, i, i+10, i+20] for i in range(num_boxes)]
    session = DummySession(input_shape=(1, 3, 256, 192), num_keypoints=num_keypoints, wx=10, wy=10)
    keypoints, scores = inference_pose(session, bbox, img) # 142ms -> 79.7ms (78.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import numpy as np
# imports
import pytest
from invokeai.backend.image_util.dw_openpose.onnxpose import inference_pose


# --- Function to test (copied from above, with minimal modifications for testability) ---
class DummyInput:
    def __init__(self, name, shape):
        self.name = name
        self.shape = shape

class DummyOutput:
    def __init__(self, name):
        self.name = name

class DummySession:
    def __init__(self, input_shape, output_shapes, output_names, output_values=None):
        self._inputs = [DummyInput("input", input_shape)]
        self._outputs = [DummyOutput(n) for n in output_names]
        self.output_shapes = output_shapes
        self.output_names = output_names
        self.output_values = output_values  # Optional: list of outputs to return (for edge case control)

    def get_inputs(self):
        return self._inputs

    def get_outputs(self):
        return self._outputs

    def run(self, output_names, input_dict):
        # Return dummy outputs: each output is a tuple (simcc_x, simcc_y)
        # simcc_x and simcc_y are arrays of shape (1, K, Wx), (1, K, Wy)
        # For test, we use simple patterns so we can check correctness
        if self.output_values is not None:
            return self.output_values
        # Default: output_names is ["simcc_x", "simcc_y"]
        # We'll return [simcc_x, simcc_y] as a list
        K, Wx, Wy = 3, 10, 10
        simcc_x = np.zeros((1, K, Wx), dtype=np.float32)
        simcc_y = np.zeros((1, K, Wy), dtype=np.float32)
        # Set a max at index 5 for all keypoints
        simcc_x[0, :, 5] = 1.0
        simcc_y[0, :, 7] = 1.0
        return [simcc_x, simcc_y]
from invokeai.backend.image_util.dw_openpose.onnxpose import inference_pose

# ------------------- UNIT TESTS -------------------
# 1. Basic Test Cases

def make_dummy_img(h, w, c=3, v=128):
    """Helper to create a dummy image of shape (h, w, c) with value v."""
    return np.full((h, w, c), v, dtype=np.float32)

def test_basic_single_bbox():
    """Test with a single bounding box on a standard image."""
    img = make_dummy_img(256, 192)
    bbox = [[10, 20, 100, 120]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.29ms -> 927μs (39.4% faster)

def test_basic_no_bbox():
    """Test with no bounding box (should use whole image)."""
    img = make_dummy_img(256, 192)
    bbox = []
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.57ms -> 1.24ms (26.0% faster)

def test_basic_multiple_bboxes():
    """Test with multiple bounding boxes."""
    img = make_dummy_img(256, 192)
    bbox = [
        [0, 0, 50, 50],
        [60, 70, 120, 150]
    ]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 2.64ms -> 1.94ms (36.1% faster)

def test_basic_different_image_shape():
    """Test with a non-square image and a single bbox."""
    img = make_dummy_img(300, 100)
    bbox = [[10, 10, 90, 290]]
    sess = DummySession(input_shape=(1, 3, 300, 100), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 978μs -> 761μs (28.5% faster)

# 2. Edge Test Cases

def test_empty_image():
    """Test with an empty image (all zeros)."""
    img = np.zeros((256, 192, 3), dtype=np.float32)
    bbox = [[0, 0, 192, 256]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.58ms -> 1.26ms (25.3% faster)

def test_bbox_outside_image():
    """Test with a bbox that is partially outside the image."""
    img = make_dummy_img(256, 192)
    bbox = [[-10, -10, 300, 300]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.99ms -> 1.64ms (21.1% faster)

def test_empty_bbox_list_and_empty_image():
    """Test with both empty bbox list and empty image."""
    img = np.zeros((256, 192, 3), dtype=np.float32)
    bbox = []
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.57ms -> 1.23ms (26.8% faster)

def test_bbox_zero_area():
    """Test with a bbox with zero area (x0==x1 or y0==y1)."""
    img = make_dummy_img(256, 192)
    bbox = [[50, 50, 50, 100], [30, 40, 70, 40]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 2.35ms -> 1.67ms (40.8% faster)

def test_non_integer_bbox():
    """Test with float bbox coordinates."""
    img = make_dummy_img(256, 192)
    bbox = [[10.5, 20.7, 100.2, 120.9]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.25ms -> 910μs (37.7% faster)

def test_bbox_minimal_size():
    """Test with a bbox of minimal size (1x1)."""
    img = make_dummy_img(256, 192)
    bbox = [[10, 10, 11, 11]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.20ms -> 868μs (38.6% faster)

def test_large_float_values():
    """Test with large float values in image."""
    img = np.full((256, 192, 3), 1e6, dtype=np.float32)
    bbox = [[0, 0, 192, 256]]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.57ms -> 1.22ms (28.3% faster)

# 3. Large Scale Test Cases

def test_many_bboxes():
    """Test with a large number of bounding boxes."""
    img = make_dummy_img(256, 192)
    N = 100  # Keep under 1000 for performance
    bbox = [[i, i, i+10, i+20] for i in range(N)]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 141ms -> 76.7ms (84.5% faster)

def test_large_image():
    """Test with a large image size."""
    img = make_dummy_img(512, 384)
    bbox = [[0, 0, 384, 512]]
    sess = DummySession(input_shape=(1, 3, 512, 384), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 7.38ms -> 4.16ms (77.5% faster)

def test_large_image_many_bboxes():
    """Test with a large image and many bboxes."""
    img = make_dummy_img(512, 384)
    N = 50
    bbox = [[i*2, i*2, i*2+10, i*2+20] for i in range(N)]
    sess = DummySession(input_shape=(1, 3, 512, 384), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 258ms -> 149ms (73.1% faster)

def test_performance_large_batch():
    """Performance test: process 500 bboxes (should not timeout)."""
    img = make_dummy_img(256, 192)
    N = 500
    bbox = [[i, i, i+10, i+20] for i in range(N)]
    sess = DummySession(input_shape=(1, 3, 256, 192), output_shapes=[(1, 3, 10), (1, 3, 10)], output_names=["simcc_x", "simcc_y"])
    keypoints, scores = inference_pose(sess, bbox, img) # 1.06s -> 694ms (52.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-inference_pose-mhn54zua and push.

Codeflash Static Badge

The optimization achieves a **73% speedup** by eliminating memory allocation overhead and reducing redundant API calls in the image preprocessing pipeline. 

**Key optimizations:**

1. **In-place normalization with precomputed constants**: The original code created new `mean` and `std` arrays 1,779 times per run. The optimized version uses global `_MEAN` and `_STD` constants and performs normalization in-place with `np.subtract()` and `np.divide()`, reducing the normalization time from 64.3% to 37.7% of preprocessing time.

2. **Reduced ONNX session overhead**: The original code called `sess.get_outputs()` and built the output list inside the loop for each image. The optimization moves these calls outside the loop, reducing inference overhead from 20% to 14% of session time.

3. **Float32 consistency**: Using `dtype=np.float32` for bounding boxes and intermediate arrays aligns with typical ONNX model expectations, avoiding unnecessary type conversions.

4. **Vectorized postprocessing**: Precomputing `input_size_arr` as a numpy array enables more efficient broadcasting in keypoint rescaling operations.

**Performance impact**: The optimizations are particularly effective for workloads with many bounding boxes, as shown in the test results where cases with 100+ boxes see 50-90% speedups. Single bbox cases still benefit from 25-40% improvements due to the eliminated allocations. The optimizations maintain identical mathematical behavior while significantly reducing memory churn in the preprocessing hot path.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 6, 2025 08:04
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant