Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 6, 2025

📄 366% (3.66x) speedup for remove_pattern in invokeai/app/util/controlnet_utils.py

⏱️ Runtime : 22.2 milliseconds 4.76 milliseconds (best of 152 runs)

📝 Explanation and details

The optimized code achieves a 365% speedup by eliminating expensive NumPy array operations and replacing them with more efficient alternatives.

Key Optimizations:

  1. Eliminated np.where() for indexing: The original code used objects = np.where(objects > 127) which creates a tuple of index arrays. This is expensive because np.where must scan the entire array and build coordinate arrays for all matching elements. The optimized version uses direct boolean masking with mask = objects > 127, which is much faster.

  2. Replaced tuple introspection with direct counting: Instead of checking objects[0].shape[0] > 0 to determine if any patterns were found, the optimization uses np.count_nonzero(mask) which directly counts True values in the boolean mask. This avoids the overhead of creating index arrays entirely.

  3. Conditional assignment optimization: The optimized version only performs the expensive array assignment x[mask] = 0 when patterns are actually found (if count:), avoiding unnecessary work in cases where no patterns match.

Performance Impact Analysis:

  • The line profiler shows the original np.where() call took 68.4% of total execution time (19.5ms), while the optimized boolean mask creation takes only 15.4% (0.88ms)
  • Array assignment time was reduced from 20.7% to 20.1%, and only executes when needed
  • The optimization is particularly effective for large-scale test cases, showing 280-935% speedups on 1000x1000 images

Best Performance Gains:

  • Large images with no matching patterns: 405% faster (avoids expensive indexing entirely)
  • Dense pattern matching: 935% faster (boolean operations scale better than coordinate generation)
  • All test cases show consistent 10-40% improvements for smaller images

The optimization maintains identical behavior while dramatically reducing computational overhead through more efficient NumPy operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 37 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import cv2
import numpy as np
# imports
import pytest
from invokeai.app.util.controlnet_utils import remove_pattern

# unit tests

# --- Basic Test Cases ---

def test_basic_single_pattern_removal():
    # Create a simple 5x5 image with a cross pattern in the center
    img = np.zeros((5,5), dtype=np.uint8)
    img[2,1:4] = 255
    img[1:4,2] = 255
    # Cross-shaped kernel
    kernel = np.array([[0,1,0],[1,1,1],[0,1,0]], dtype=np.uint8)
    # Copy for comparison
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 43.0μs -> 39.1μs (10.1% faster)

def test_basic_no_pattern_found():
    # Image with no matching pattern
    img = np.zeros((5,5), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 32.8μs -> 25.8μs (27.0% faster)

def test_basic_multiple_patterns():
    # Image with two cross patterns
    img = np.zeros((7,7), dtype=np.uint8)
    # Place two crosses
    img[2,1:4] = 255; img[1:4,2] = 255
    img[4,3:6] = 255; img[3:6,4] = 255
    kernel = np.array([[0,1,0],[1,1,1],[0,1,0]], dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 29.9μs -> 25.9μs (15.4% faster)

# --- Edge Test Cases ---


def test_edge_kernel_larger_than_image():
    # Image smaller than kernel
    img = np.ones((2,2), dtype=np.uint8) * 255
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 36.0μs -> 31.9μs (12.8% faster)

def test_edge_all_zeros_image():
    # All zeros image
    img = np.zeros((10,10), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 30.0μs -> 22.8μs (31.4% faster)

def test_edge_all_ones_image():
    # All ones image (255)
    img = np.ones((10,10), dtype=np.uint8) * 255
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 25.6μs -> 19.9μs (28.9% faster)
    # Only the center region should be affected
    changed_pixels = np.sum(result != img)

def test_edge_non_square_image_and_kernel():
    # Non-square image and kernel
    img = np.zeros((5,10), dtype=np.uint8)
    img[2,4:7] = 255
    kernel = np.array([[1,1,1]], dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 28.3μs -> 23.7μs (19.7% faster)

def test_edge_dtype_variation():
    # Test with float32 dtype (should fail, as OpenCV expects uint8)
    img = np.ones((5,5), dtype=np.float32) * 255
    kernel = np.ones((3,3), dtype=np.uint8)
    with pytest.raises(cv2.error):
        remove_pattern(img, kernel) # 30.0μs -> 29.6μs (1.31% faster)

def test_edge_kernel_with_zeroes():
    # Kernel with zeroes (should only match certain patterns)
    img = np.zeros((5,5), dtype=np.uint8)
    img[2,2] = 255
    kernel = np.array([[0,0,0],[0,1,0],[0,0,0]], dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 34.0μs -> 30.0μs (13.2% faster)

def test_edge_image_with_negative_values():
    # Image with negative values (simulate via int16)
    img = np.zeros((5,5), dtype=np.int16)
    img[2,2] = -10
    kernel = np.ones((3,3), dtype=np.uint8)
    # Should raise error due to wrong dtype
    with pytest.raises(cv2.error):
        remove_pattern(img, kernel) # 25.8μs -> 24.6μs (5.11% faster)

# --- Large Scale Test Cases ---

def test_large_scale_no_pattern():
    # Large image, no pattern
    img = np.zeros((1000,1000), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 1.92ms -> 379μs (405% faster)

def test_large_scale_many_patterns():
    # Large image with many patterns
    img = np.zeros((1000,1000), dtype=np.uint8)
    # Place patterns every 10 pixels
    for i in range(10, 990, 10):
        for j in range(10, 990, 10):
            img[i-1:i+2, j-1:j+2] = 255
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 2.08ms -> 546μs (280% faster)
    # Check that many pixels were removed
    removed_pixels = np.sum(result != img)

def test_large_scale_performance():
    # Large image to check performance (should not timeout)
    img = np.random.randint(0, 2, size=(1000,1000), dtype=np.uint8) * 255
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 1.97ms -> 537μs (266% faster)

def test_large_scale_edge_of_image():
    # Patterns at the edge should not be removed (since kernel cannot fit)
    img = np.zeros((1000,1000), dtype=np.uint8)
    img[0:3,0:3] = 255  # Top-left corner
    kernel = np.ones((3,3), dtype=np.uint8)
    img_copy = img.copy()
    result, found = remove_pattern(img_copy, kernel) # 1.93ms -> 490μs (293% faster)

# --- Additional Robustness Tests ---

def test_pattern_removal_is_inplace():
    # Ensure function modifies the input array inplace
    img = np.zeros((5,5), dtype=np.uint8)
    img[2,2] = 255
    kernel = np.array([[0,0,0],[0,1,0],[0,0,0]], dtype=np.uint8)
    result, found = remove_pattern(img, kernel) # 40.1μs -> 35.9μs (11.7% faster)

def test_return_types():
    # Ensure function returns (ndarray, bool)
    img = np.zeros((5,5), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.uint8)
    result, found = remove_pattern(img, kernel) # 27.7μs -> 22.4μs (23.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import cv2
import numpy as np
# imports
import pytest
from invokeai.app.util.controlnet_utils import remove_pattern

# ------------------- UNIT TESTS -------------------

# ----------- BASIC TEST CASES -----------

def test_remove_pattern_single_match():
    # Simple 3x3 array with a single matching pattern in the center
    img = np.array([
        [0,0,0],
        [0,255,0],
        [0,0,0]
    ], dtype=np.uint8)
    kernel = np.array([
        [0,0,0],
        [0,1,0],
        [0,0,0]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 29.1μs -> 25.1μs (15.9% faster)

def test_remove_pattern_no_match():
    # No pattern matches
    img = np.zeros((3,3), dtype=np.uint8)
    kernel = np.array([
        [0,1,0],
        [1,1,1],
        [0,1,0]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 26.7μs -> 21.4μs (24.4% faster)

def test_remove_pattern_multiple_matches():
    # Multiple matching patterns
    img = np.array([
        [255,0,255],
        [0,255,0],
        [255,0,255]
    ], dtype=np.uint8)
    kernel = np.array([
        [1,0,1],
        [0,1,0],
        [1,0,1]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 26.8μs -> 23.1μs (16.3% faster)

def test_remove_pattern_partial_match():
    # Only part of the kernel matches
    img = np.array([
        [255,255,255],
        [255,0,255],
        [255,255,255]
    ], dtype=np.uint8)
    kernel = np.array([
        [1,1,1],
        [1,0,1],
        [1,1,1]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 26.8μs -> 22.2μs (20.8% faster)

# ----------- EDGE TEST CASES -----------


def test_remove_pattern_single_pixel_image():
    # Single pixel image, kernel matches
    img = np.array([[255]], dtype=np.uint8)
    kernel = np.array([[1]], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 42.6μs -> 38.7μs (10.0% faster)

def test_remove_pattern_kernel_larger_than_image():
    # Kernel is larger than the image
    img = np.array([[255]], dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 34.4μs -> 29.7μs (16.0% faster)

def test_remove_pattern_non_binary_image():
    # Image contains values other than 0 and 255
    img = np.array([
        [100,150,200],
        [255,0,255],
        [50, 0, 255]
    ], dtype=np.uint8)
    kernel = np.array([
        [0,0,0],
        [0,1,0],
        [0,0,0]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 31.1μs -> 26.7μs (16.6% faster)

def test_remove_pattern_all_zeros_image():
    # All zeros image, should not match any pattern except all-zero kernel
    img = np.zeros((5,5), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 29.3μs -> 22.6μs (29.5% faster)

def test_remove_pattern_all_ones_kernel():
    # All-ones kernel, only matches if image is all 255 in a region
    img = np.full((5,5), 255, dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 28.4μs -> 23.3μs (21.7% faster)

def test_remove_pattern_kernel_with_negative_ones():
    # Kernel with -1, which means "don't care" in cv2
    img = np.array([
        [255,255,255],
        [255,0,255],
        [255,255,255]
    ], dtype=np.uint8)
    kernel = np.array([
        [-1, 1, -1],
        [ 1, 0,  1],
        [-1, 1, -1]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 30.5μs -> 26.4μs (15.8% faster)

def test_remove_pattern_non_square_kernel():
    # Non-square kernel
    img = np.array([
        [0,255,0,255],
        [255,0,255,0],
        [0,255,0,255]
    ], dtype=np.uint8)
    kernel = np.array([
        [0,1,0,1],
        [1,0,1,0],
        [0,1,0,1]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 26.6μs -> 22.7μs (16.9% faster)

def test_remove_pattern_kernel_all_zeros():
    # Kernel is all zeros (should not match anything)
    img = np.full((3,3), 255, dtype=np.uint8)
    kernel = np.zeros((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 19.9μs -> 16.6μs (19.6% faster)

# ----------- LARGE SCALE TEST CASES -----------

def test_remove_pattern_large_image_sparse_pattern():
    # Large image with a single matching pattern
    img = np.zeros((1000,1000), dtype=np.uint8)
    img[500,500] = 255
    kernel = np.array([
        [0,0,0],
        [0,1,0],
        [0,0,0]
    ], dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 1.88ms -> 436μs (330% faster)

def test_remove_pattern_large_image_dense_pattern():
    # Large image filled with 255, should remove all inner pixels
    img = np.full((1000,1000), 255, dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 7.61ms -> 735μs (935% faster)

def test_remove_pattern_large_image_no_match():
    # Large image with no matching pattern
    img = np.zeros((1000,1000), dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 1.91ms -> 377μs (406% faster)

def test_remove_pattern_large_kernel():
    # Large kernel on a large image, only one match
    img = np.zeros((100,100), dtype=np.uint8)
    img[50:60, 50:60] = 255
    kernel = np.ones((10,10), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 62.0μs -> 42.8μs (44.9% faster)

def test_remove_pattern_large_image_multiple_matches():
    # Large image with multiple, non-overlapping matching blocks
    img = np.zeros((1000,1000), dtype=np.uint8)
    for i in range(0,1000,100):
        for j in range(0,1000,100):
            img[i+1:i+4, j+1:j+4] = 255
    kernel = np.ones((3,3), dtype=np.int8)
    result, found = remove_pattern(img.copy(), kernel) # 1.98ms -> 489μs (305% faster)
    # Each 3x3 block should be zeroed
    for i in range(0,1000,100):
        for j in range(0,1000,100):
            pass

# ----------- ADDITIONAL EDGE CASES -----------

def test_remove_pattern_image_dtype_int32():
    # Image with dtype int32 (should still work)
    img = np.full((3,3), 255, dtype=np.int32)
    kernel = np.ones((3,3), dtype=np.int8)
    # cv2 expects uint8, so we cast
    result, found = remove_pattern(img.astype(np.uint8), kernel) # 37.9μs -> 33.0μs (14.7% faster)

def test_remove_pattern_kernel_dtype_float():
    # Kernel with dtype float (should cast to int8)
    img = np.full((3,3), 255, dtype=np.uint8)
    kernel = np.ones((3,3), dtype=np.float32)
    result, found = remove_pattern(img.copy(), kernel.astype(np.int8)) # 27.2μs -> 22.4μs (21.4% faster)

def test_remove_pattern_image_with_nan():
    # Image with NaN values (should treat as zero)
    img = np.zeros((3,3), dtype=np.float32)
    img[1,1] = np.nan
    kernel = np.ones((3,3), dtype=np.int8)
    # cv2 can't handle NaN, so we fill NaN with 0
    img_no_nan = np.nan_to_num(img, nan=0).astype(np.uint8)
    result, found = remove_pattern(img_no_nan, kernel) # 26.1μs -> 19.1μs (37.0% faster)

def test_remove_pattern_image_with_negative_values():
    # Image with negative values (should treat as zero)
    img = np.array([
        [0, -1, 0],
        [-1, 255, -1],
        [0, -1, 0]
    ], dtype=np.int16)
    kernel = np.array([
        [0,0,0],
        [0,1,0],
        [0,0,0]
    ], dtype=np.int8)
    # cv2 expects uint8, so we clip and cast
    img_uint8 = np.clip(img, 0, 255).astype(np.uint8)
    result, found = remove_pattern(img_uint8, kernel) # 26.5μs -> 22.2μs (19.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-remove_pattern-mhn8bgfb and push.

Codeflash Static Badge

The optimized code achieves a **365% speedup** by eliminating expensive NumPy array operations and replacing them with more efficient alternatives.

**Key Optimizations:**

1. **Eliminated `np.where()` for indexing**: The original code used `objects = np.where(objects > 127)` which creates a tuple of index arrays. This is expensive because `np.where` must scan the entire array and build coordinate arrays for all matching elements. The optimized version uses direct boolean masking with `mask = objects > 127`, which is much faster.

2. **Replaced tuple introspection with direct counting**: Instead of checking `objects[0].shape[0] > 0` to determine if any patterns were found, the optimization uses `np.count_nonzero(mask)` which directly counts True values in the boolean mask. This avoids the overhead of creating index arrays entirely.

3. **Conditional assignment optimization**: The optimized version only performs the expensive array assignment `x[mask] = 0` when patterns are actually found (`if count:`), avoiding unnecessary work in cases where no patterns match.

**Performance Impact Analysis:**
- The line profiler shows the original `np.where()` call took **68.4%** of total execution time (19.5ms), while the optimized boolean mask creation takes only **15.4%** (0.88ms)
- Array assignment time was reduced from **20.7%** to **20.1%**, and only executes when needed
- The optimization is particularly effective for large-scale test cases, showing **280-935% speedups** on 1000x1000 images

**Best Performance Gains:**
- Large images with no matching patterns: **405% faster** (avoids expensive indexing entirely)
- Dense pattern matching: **935% faster** (boolean operations scale better than coordinate generation)
- All test cases show consistent 10-40% improvements for smaller images

The optimization maintains identical behavior while dramatically reducing computational overhead through more efficient NumPy operations.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 6, 2025 09:33
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant