Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 6, 2025

📄 56% (0.56x) speedup for XLabsControlNetExtension._xlabs_output_to_controlnet_output in invokeai/backend/flux/extensions/xlabs_controlnet_extension.py

⏱️ Runtime : 92.2 microseconds 59.0 microseconds (best of 215 runs)

📝 Explanation and details

The optimization replaces a generic modulo-based loop with specialized, efficient list operations based on the relationship between input length and target length (19).

Key optimizations:

  1. Eliminated expensive loop and modulo operations: The original code used for i in range(19) with xlabs_double_block_residuals[i % len(xlabs_double_block_residuals)], performing 19 iterations with modulo calculations and individual list indexing operations.

  2. Leveraged Python's efficient list multiplication: For the most common case where input length is 1, the optimization uses xlabs_double_block_residuals * n to create 19 references in a single operation, eliminating 18 loop iterations and all modulo calculations.

  3. Added fast-path for exact matches: When input length equals 19, it uses xlabs_double_block_residuals[:] (shallow copy) instead of cycling through the loop.

  4. Optimized general case with batch operations: For other lengths, it uses integer division to determine full repetitions (xlabs_double_block_residuals * reps) and handles remainders with slice operations (xlabs_double_block_residuals[:rem]), reducing the number of individual append operations.

Performance impact by test case:

  • Single tensor repetition (most common): 96-103% faster due to eliminating the 19-iteration loop
  • Exact 19-tensor match: 80-98% faster using shallow copy instead of cycling
  • Multiple tensor cycling: 17-42% faster through batch operations instead of individual appends

The optimization is particularly effective because it targets the bottleneck identified in the profiler: the loop (30.9% of time) and individual appends (41.5% of time). By replacing these with native Python list operations that are implemented in C, the function achieves a 56% overall speedup while maintaining identical functionality.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 50 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
import torch
from invokeai.backend.flux.extensions.xlabs_controlnet_extension import \
    XLabsControlNetExtension

# Minimal stubs for required classes and datatypes for unit testing

class ControlNetFluxOutput:
    def __init__(self, double_block_residuals, single_block_residuals):
        self.double_block_residuals = double_block_residuals
        self.single_block_residuals = single_block_residuals

class XLabsControlNetFluxOutput:
    def __init__(self, controlnet_double_block_residuals):
        self.controlnet_double_block_residuals = controlnet_double_block_residuals

class XLabsControlNetFlux:
    pass  # Not used in tested function

class BaseControlNetExtension:
    def __init__(self, weight, begin_step_percent, end_step_percent):
        self._weight = weight
        self._begin_step_percent = begin_step_percent
        self._end_step_percent = end_step_percent
from invokeai.backend.flux.extensions.xlabs_controlnet_extension import \
    XLabsControlNetExtension

# ---------------------- UNIT TESTS BELOW ----------------------

# Helper function to create dummy tensors
def make_tensor(shape, fill):
    return torch.full(shape, fill, dtype=torch.float32)

# Basic Test Cases

def test_basic_single_residual_repeats():
    """Test with a single tensor in the input list; should repeat 19 times."""
    tensor = make_tensor((2, 2), 1.0)
    xlabs_out = XLabsControlNetFluxOutput([tensor])
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.55μs -> 1.81μs (96.0% faster)
    for t in result.double_block_residuals:
        pass

def test_basic_multiple_residuals_cycle():
    """Test with 3 tensors in the input list; should cycle through them 19 times."""
    tensors = [make_tensor((2, 2), i) for i in range(3)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.40μs -> 2.63μs (29.4% faster)
    for i, t in enumerate(result.double_block_residuals):
        expected_tensor = tensors[i % 3]

def test_basic_exactly_19_residuals():
    """Test with exactly 19 tensors in the input; output should match input order."""
    tensors = [make_tensor((2, 2), i) for i in range(19)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.31μs -> 1.72μs (92.3% faster)
    for i, t in enumerate(result.double_block_residuals):
        pass

# Edge Test Cases


def test_edge_none_residuals():
    """Test with None as controlnet_double_block_residuals."""
    xlabs_out = XLabsControlNetFluxOutput(None)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 1.61μs -> 1.32μs (21.9% faster)

def test_edge_large_tensor_shapes():
    """Test with large tensor shapes but under 100MB total."""
    # Each tensor: (3, 128, 128) float32 = 3*128*128*4 = 196,608 bytes (~0.19MB)
    # 19 tensors = ~3.7MB
    tensors = [make_tensor((3, 128, 128), i) for i in range(5)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 128, 128), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 4.09μs -> 3.48μs (17.5% faster)
    for i, t in enumerate(result.double_block_residuals):
        expected_tensor = tensors[i % 5]

def test_edge_one_residual_large_tensor():
    """Test with a single large tensor, repeated 19 times."""
    # (3, 256, 256) ~0.75MB
    tensor = make_tensor((3, 256, 256), 42.0)
    xlabs_out = XLabsControlNetFluxOutput([tensor])
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 256, 256), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 4.32μs -> 2.12μs (103% faster)
    for t in result.double_block_residuals:
        pass

def test_edge_input_list_longer_than_19():
    """Test with input list longer than 19; only first 19 used in order."""
    tensors = [make_tensor((2, 2), i) for i in range(25)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.41μs -> 2.45μs (39.2% faster)
    # Should cycle through the first 19, then wrap around
    for i, t in enumerate(result.double_block_residuals):
        pass

# Large Scale Test Cases

def test_large_scale_many_unique_tensors():
    """Test with 100 unique tensors, cycling through them for 19 outputs."""
    tensors = [make_tensor((2, 2), i) for i in range(100)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.30μs -> 2.32μs (42.4% faster)
    # Should cycle through the first 19 tensors (since 19 < 100, no repeats)
    for i, t in enumerate(result.double_block_residuals):
        pass

def test_large_scale_input_length_19():
    """Test with input list of length 19, each tensor unique and large."""
    tensors = [make_tensor((3, 128, 128), i) for i in range(19)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 128, 128), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 4.31μs -> 2.38μs (80.9% faster)
    for i, t in enumerate(result.double_block_residuals):
        pass

def test_large_scale_input_length_1_large_tensor():
    """Test with a single very large tensor (under 100MB), repeated 19 times."""
    # (3, 512, 512) = 3*512*512*4 = 3,145,728 bytes ~3MB
    tensor = make_tensor((3, 512, 512), 7.0)
    xlabs_out = XLabsControlNetFluxOutput([tensor])
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 512, 512), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 5.20μs -> 2.66μs (95.3% faster)
    for t in result.double_block_residuals:
        pass

def test_large_scale_input_length_19_large_tensors():
    """Test with 19 large tensors, each unique, for max memory usage under 100MB."""
    # Each tensor: (3, 256, 256) ~0.75MB, 19*0.75 = ~14MB
    tensors = [make_tensor((3, 256, 256), i) for i in range(19)]
    xlabs_out = XLabsControlNetFluxOutput(tensors)
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 256, 256), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 4.77μs -> 2.41μs (97.7% faster)
    for i, t in enumerate(result.double_block_residuals):
        pass

# Additional edge: test with non-float tensors
def test_edge_non_float_tensor():
    """Test with integer tensors (should still work, as torch.equal allows)."""
    tensor = torch.ones((2, 2), dtype=torch.int32)
    xlabs_out = XLabsControlNetFluxOutput([tensor])
    ext = XLabsControlNetExtension(
        model=XLabsControlNetFlux(),
        controlnet_cond=make_tensor((1, 3, 8, 8), 0),
        weight=1.0,
        begin_step_percent=0.0,
        end_step_percent=1.0,
    )
    codeflash_output = ext._xlabs_output_to_controlnet_output(xlabs_out); result = codeflash_output # 3.66μs -> 1.90μs (93.0% faster)
    for t in result.double_block_residuals:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
import torch
from invokeai.backend.flux.extensions.xlabs_controlnet_extension import \
    XLabsControlNetExtension


# Minimal stubs for required classes for testing
class ControlNetFluxOutput:
    def __init__(self, double_block_residuals, single_block_residuals):
        self.double_block_residuals = double_block_residuals
        self.single_block_residuals = single_block_residuals

class XLabsControlNetFluxOutput:
    def __init__(self, controlnet_double_block_residuals):
        self.controlnet_double_block_residuals = controlnet_double_block_residuals

class BaseControlNetExtension:
    def __init__(self, weight, begin_step_percent, end_step_percent):
        self._weight = weight
        self._begin_step_percent = begin_step_percent
        self._end_step_percent = end_step_percent
from invokeai.backend.flux.extensions.xlabs_controlnet_extension import \
    XLabsControlNetExtension


# Helper function to create dummy tensors
def make_tensor(val, shape=(1,)):
    # Create a tensor filled with 'val' and of the specified shape
    return torch.full(shape, val)

# Fixtures for extension instance
@pytest.fixture
def extension():
    # Dummy model and cond, not used in tested function
    model = None
    cond = torch.zeros((1, 3, 32, 32))
    return XLabsControlNetExtension(model, cond, weight=1.0, begin_step_percent=0.0, end_step_percent=1.0)

# ---------------- BASIC TEST CASES ----------------

def test_basic_single_residual(extension):
    # Single tensor in input list
    input_tensor = make_tensor(5.0, shape=(2, 2))
    xlabs_output = XLabsControlNetFluxOutput([input_tensor])
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 4.27μs -> 2.42μs (76.6% faster)
    for t in output.double_block_residuals:
        pass

def test_basic_multiple_residuals(extension):
    # Multiple tensors in input list
    tensors = [make_tensor(i, shape=(2, 2)) for i in range(3)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.60μs -> 2.92μs (23.2% faster)
    for i, t in enumerate(output.double_block_residuals):
        expected_tensor = tensors[i % 3]

def test_basic_none_residuals(extension):
    # None input should result in empty output list
    xlabs_output = XLabsControlNetFluxOutput(None)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 1.43μs -> 1.29μs (10.8% faster)

# ---------------- EDGE TEST CASES ----------------


def test_edge_single_tensor_large_shape(extension):
    # Large tensor, but under 100MB (e.g., 100x100x3 floats = ~120KB)
    tensor = make_tensor(1.0, shape=(3, 100, 100))
    xlabs_output = XLabsControlNetFluxOutput([tensor])
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 4.28μs -> 2.21μs (93.3% faster)
    for t in output.double_block_residuals:
        pass

def test_edge_more_than_19_residuals(extension):
    # More than 19 tensors: should only use first 19 in cycling
    tensors = [make_tensor(i, shape=(1,)) for i in range(25)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.50μs -> 2.62μs (33.7% faster)
    for i, t in enumerate(output.double_block_residuals):
        pass

def test_edge_zero_length(extension):
    # Zero-length tensor in list
    tensor = make_tensor(0.0, shape=(0,))
    xlabs_output = XLabsControlNetFluxOutput([tensor])
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.64μs -> 1.87μs (94.1% faster)
    for t in output.double_block_residuals:
        pass

def test_edge_different_shapes(extension):
    # Tensors of different shapes: should cycle as-is
    tensors = [make_tensor(i, shape=(i+1,)) for i in range(3)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.41μs -> 2.75μs (24.3% faster)
    for i, t in enumerate(output.double_block_residuals):
        expected_tensor = tensors[i % 3]

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_scale_999_residuals(extension):
    # 999 tensors, should cycle through them for 19 outputs
    tensors = [make_tensor(i, shape=(1,)) for i in range(999)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 4.26μs -> 3.17μs (34.4% faster)
    for i, t in enumerate(output.double_block_residuals):
        pass

def test_large_scale_large_tensor(extension):
    # Large tensor, close to 100MB: 3x1024x1024 float32 = ~12MB, safe for test
    tensor = make_tensor(2.0, shape=(3, 1024, 1024))
    xlabs_output = XLabsControlNetFluxOutput([tensor])
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 5.42μs -> 2.83μs (92.0% faster)
    for t in output.double_block_residuals:
        pass

def test_large_scale_many_inputs_and_shapes(extension):
    # 19 tensors, each a different shape, should cycle through exactly once
    tensors = [make_tensor(i, shape=(i+1,)) for i in range(19)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.38μs -> 1.98μs (70.9% faster)
    for i, t in enumerate(output.double_block_residuals):
        pass

def test_large_scale_batch_tensors(extension):
    # Tensors with batch dimension, e.g., (8, 3, 32, 32)
    tensors = [make_tensor(i, shape=(8, 3, 32, 32)) for i in range(5)]
    xlabs_output = XLabsControlNetFluxOutput(tensors)
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.89μs -> 3.40μs (14.6% faster)
    for i, t in enumerate(output.double_block_residuals):
        pass

# ---------------- MISCELLANEOUS TESTS ----------------

def test_input_is_not_list_or_none(extension):
    # If input is not a list or None, should raise TypeError when trying to iterate
    xlabs_output = XLabsControlNetFluxOutput(123)
    with pytest.raises(TypeError):
        extension._xlabs_output_to_controlnet_output(xlabs_output) # 2.29μs -> 1.45μs (57.9% faster)

def test_input_list_contains_non_tensor(extension):
    # If input list contains non-tensor, should raise AttributeError when trying to use torch.equal
    xlabs_output = XLabsControlNetFluxOutput([make_tensor(1), "not a tensor"])
    codeflash_output = extension._xlabs_output_to_controlnet_output(xlabs_output); output = codeflash_output # 3.85μs -> 2.90μs (32.9% faster)
    # Only the first tensor should be torch, the rest will be string, so torch.equal will fail in test
    for i, t in enumerate(output.double_block_residuals):
        if i % 2 == 0:
            pass
        else:
            pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-XLabsControlNetExtension._xlabs_output_to_controlnet_output-mhncttva and push.

Codeflash Static Badge

The optimization replaces a generic modulo-based loop with specialized, efficient list operations based on the relationship between input length and target length (19).

**Key optimizations:**

1. **Eliminated expensive loop and modulo operations**: The original code used `for i in range(19)` with `xlabs_double_block_residuals[i % len(xlabs_double_block_residuals)]`, performing 19 iterations with modulo calculations and individual list indexing operations.

2. **Leveraged Python's efficient list multiplication**: For the most common case where input length is 1, the optimization uses `xlabs_double_block_residuals * n` to create 19 references in a single operation, eliminating 18 loop iterations and all modulo calculations.

3. **Added fast-path for exact matches**: When input length equals 19, it uses `xlabs_double_block_residuals[:]` (shallow copy) instead of cycling through the loop.

4. **Optimized general case with batch operations**: For other lengths, it uses integer division to determine full repetitions (`xlabs_double_block_residuals * reps`) and handles remainders with slice operations (`xlabs_double_block_residuals[:rem]`), reducing the number of individual append operations.

**Performance impact by test case:**
- Single tensor repetition (most common): 96-103% faster due to eliminating the 19-iteration loop
- Exact 19-tensor match: 80-98% faster using shallow copy instead of cycling
- Multiple tensor cycling: 17-42% faster through batch operations instead of individual appends

The optimization is particularly effective because it targets the bottleneck identified in the profiler: the loop (`30.9% of time`) and individual appends (`41.5% of time`). By replacing these with native Python list operations that are implemented in C, the function achieves a 56% overall speedup while maintaining identical functionality.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 6, 2025 11:39
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant