Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 7, 2025

📄 8% (0.08x) speedup for SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result in python/sglang/srt/managers/scheduler_output_processor_mixin.py

⏱️ Runtime : 141 microseconds 131 microseconds (best of 119 runs)

📝 Explanation and details

The optimization achieves a 7% speedup through three key micro-optimizations that reduce overhead in the inner loop:

What was optimized:

  1. Pre-allocated list with fixed size: Changed from predict_tokens = [] with repeated .append() calls to predict_tokens = [None] * len(batch_reqs) with direct indexing assignment
  2. Cached attribute lookup: Stored batch.reqs in local variable batch_reqs to avoid repeated attribute access in the loop
  3. Optimized slice calculation: Used explicit start and end variables instead of computing the slice indices inline

Why this leads to speedup:

  • List pre-allocation eliminates the overhead of dynamic list growing and repeated append() method calls, which is especially beneficial for larger batches
  • Cached attribute access removes the repeated batch.reqs attribute lookup in each iteration, reducing Python's attribute resolution overhead
  • Local variable calculations for slice indices are slightly faster than inline arithmetic expressions in Python

Performance characteristics based on test results:

  • Small batches (1-2 requests): Shows minimal improvement or slight regression due to pre-allocation overhead
  • Large batches (100+ requests): Shows significant gains of 11-13% faster where the optimizations really pay off
  • Edge cases: Generally neutral performance impact, maintaining robustness

This optimization is particularly valuable for high-throughput scenarios with larger batch sizes, which is typical in production LLM serving workloads where this speculative decoding logic would be frequently executed.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 26 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


# function to test
class DummyDraftWorker:
    def __init__(self, speculative_num_draft_tokens):
        self.speculative_num_draft_tokens = speculative_num_draft_tokens

class DummyReq:
    def __init__(self):
        self.spec_verify_ct = 0

class DummyScheduler:
    def __init__(self, speculative_num_draft_tokens):
        self.draft_worker = DummyDraftWorker(speculative_num_draft_tokens)

class DummyGenerationBatchResult:
    def __init__(self, last_batch_allocate_lens, accept_lens, next_token_ids):
        # Simulate .tolist() for each attribute
        self.last_batch_allocate_lens = last_batch_allocate_lens
        self.accept_lens = accept_lens
        self.next_token_ids = next_token_ids

class DummyScheduleBatch:
    def __init__(self, reqs):
        self.reqs = reqs
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


# Helper class to simulate .tolist() on lists
class ListWithToList(list):
    def tolist(self):
        return list(self)

# ----------------- UNIT TESTS -----------------
# Basic Test Cases
def test_basic_single_request():
    """
    Test with one request, one draft token, accept one token.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=1)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([1]),  # last_batch_allocate_lens
        ListWithToList([1]),  # accept_lens
        ListWithToList([42])  # next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.46μs -> 2.56μs (3.75% slower)

def test_basic_multiple_requests():
    """
    Test with two requests, two draft tokens per request, different accept_lens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req1 = DummyReq()
    req2 = DummyReq()
    batch = DummyScheduleBatch([req1, req2])
    result = DummyGenerationBatchResult(
        ListWithToList([2, 2]),
        ListWithToList([2, 1]),
        ListWithToList([10, 11, 20, 21])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.47μs -> 2.78μs (11.3% slower)

def test_basic_zero_accept_lens():
    """
    Test with a request that accepts zero tokens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([0]),
        ListWithToList([1, 2])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.10μs -> 2.40μs (12.8% slower)

# Edge Test Cases
def test_edge_empty_batch():
    """
    Test with an empty batch (no requests).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    batch = DummyScheduleBatch([])
    result = DummyGenerationBatchResult(
        ListWithToList([]),
        ListWithToList([]),
        ListWithToList([])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 1.37μs -> 1.64μs (16.7% slower)

def test_edge_accept_lens_greater_than_draft_tokens():
    """
    Test with accept_lens greater than speculative_num_draft_tokens (should not happen, but test robustness).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([3]),  # accept_lens > num_draft_tokens
        ListWithToList([5, 6, 7, 8])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.09μs -> 2.25μs (7.23% slower)

def test_edge_negative_accept_lens():
    """
    Test with negative accept_lens (should not happen, but check for robustness).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([-1]),
        ListWithToList([9, 10])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.03μs -> 2.25μs (9.90% slower)

def test_edge_large_accept_lens_and_tokens():
    """
    Test with accept_lens and next_token_ids at their maximum allowed (but <1000).
    """
    N = 999
    scheduler = DummyScheduler(speculative_num_draft_tokens=N)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    tokens = list(range(N))
    result = DummyGenerationBatchResult(
        ListWithToList([N]),
        ListWithToList([N]),
        ListWithToList(tokens)
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 5.80μs -> 6.04μs (3.96% slower)

# Large Scale Test Cases

def test_large_scale_varied_accept_lens():
    """
    Test with 100 requests, 10 draft tokens, accept_lens varies from 0 to 9.
    """
    num_reqs = 100
    num_draft_tokens = 10
    scheduler = DummyScheduler(speculative_num_draft_tokens=num_draft_tokens)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyScheduleBatch(reqs)
    last_batch_allocate_lens = ListWithToList([num_draft_tokens] * num_reqs)
    accept_lens = ListWithToList([i % num_draft_tokens for i in range(num_reqs)])
    next_token_ids = []
    for i in range(num_reqs):
        next_token_ids.extend([i*num_draft_tokens + j for j in range(num_draft_tokens)])
    next_token_ids = ListWithToList(next_token_ids)

    result = DummyGenerationBatchResult(
        last_batch_allocate_lens,
        accept_lens,
        next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 20.6μs -> 18.4μs (11.9% faster)
    for i in range(num_reqs):
        expected = [i*num_draft_tokens + j for j in range(accept_lens[i])]

def test_large_scale_zero_accept_lens():
    """
    Test with 100 requests, each with 10 draft tokens, all accept_lens=0.
    """
    num_reqs = 100
    num_draft_tokens = 10
    scheduler = DummyScheduler(speculative_num_draft_tokens=num_draft_tokens)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyScheduleBatch(reqs)
    last_batch_allocate_lens = ListWithToList([num_draft_tokens] * num_reqs)
    accept_lens = ListWithToList([0] * num_reqs)
    next_token_ids = ListWithToList([i for i in range(num_reqs * num_draft_tokens)])

    result = DummyGenerationBatchResult(
        last_batch_allocate_lens,
        accept_lens,
        next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 19.1μs -> 16.8μs (13.2% faster)
    for i in range(num_reqs):
        pass

# Edge case: mismatched lengths
def test_edge_mismatched_lengths():
    """
    Test with mismatched lengths of accept_lens and next_token_ids.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=3)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    # Only 2 tokens, but 3 draft tokens specified
    result = DummyGenerationBatchResult(
        ListWithToList([3]),
        ListWithToList([2]),
        ListWithToList([100, 101])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.00μs -> 2.26μs (11.5% slower)

# Edge case: request object with pre-existing spec_verify_ct
def test_edge_request_with_existing_spec_verify_ct():
    """
    Test that spec_verify_ct increments from a nonzero value.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    req.spec_verify_ct = 5
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([2]),
        ListWithToList([200, 201])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 1.91μs -> 2.17μs (12.2% slower)

# Edge case: negative draft tokens
def test_edge_negative_draft_tokens():
    """
    Test with negative speculative_num_draft_tokens.
    Should not crash, but will not return any tokens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=-2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([2]),
        ListWithToList([1, 2])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.02μs -> 2.11μs (4.21% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

# imports
import pytest
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


class DummyDraftWorker:
    def __init__(self, speculative_num_draft_tokens):
        self.speculative_num_draft_tokens = speculative_num_draft_tokens

class DummyReq:
    def __init__(self):
        self.spec_verify_ct = 0

class DummyScheduler:
    def __init__(self, speculative_num_draft_tokens):
        self.draft_worker = DummyDraftWorker(speculative_num_draft_tokens)

class DummyBatch:
    def __init__(self, reqs):
        self.reqs = reqs

class DummyResult:
    def __init__(self, last_batch_allocate_lens, accept_lens, next_token_ids):
        # Simulate .tolist() for each attribute
        self.last_batch_allocate_lens = DummyList(last_batch_allocate_lens)
        self.accept_lens = DummyList(accept_lens)
        self.next_token_ids = DummyList(next_token_ids)

class DummyList(list):
    def tolist(self):
        return list(self)
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin

# --- End function to test ---

# --- Begin unit tests ---

# Helper function to build inputs
def build_inputs(
    num_reqs,
    speculative_num_draft_tokens,
    last_batch_allocate_lens,
    accept_lens,
    next_token_ids,
):
    scheduler = DummyScheduler(speculative_num_draft_tokens)
    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyBatch(reqs)
    result = DummyResult(last_batch_allocate_lens, accept_lens, next_token_ids)
    return scheduler, result, batch, reqs

# 1. Basic Test Cases

To edit these changes git checkout codeflash/optimize-SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result-mhot7lqo and push.

Codeflash Static Badge

…sult

The optimization achieves a **7% speedup** through three key micro-optimizations that reduce overhead in the inner loop:

**What was optimized:**
1. **Pre-allocated list with fixed size**: Changed from `predict_tokens = []` with repeated `.append()` calls to `predict_tokens = [None] * len(batch_reqs)` with direct indexing assignment
2. **Cached attribute lookup**: Stored `batch.reqs` in local variable `batch_reqs` to avoid repeated attribute access in the loop
3. **Optimized slice calculation**: Used explicit `start` and `end` variables instead of computing the slice indices inline

**Why this leads to speedup:**
- **List pre-allocation** eliminates the overhead of dynamic list growing and repeated `append()` method calls, which is especially beneficial for larger batches
- **Cached attribute access** removes the repeated `batch.reqs` attribute lookup in each iteration, reducing Python's attribute resolution overhead  
- **Local variable calculations** for slice indices are slightly faster than inline arithmetic expressions in Python

**Performance characteristics based on test results:**
- **Small batches (1-2 requests)**: Shows minimal improvement or slight regression due to pre-allocation overhead
- **Large batches (100+ requests)**: Shows significant gains of **11-13% faster** where the optimizations really pay off
- **Edge cases**: Generally neutral performance impact, maintaining robustness

This optimization is particularly valuable for high-throughput scenarios with larger batch sizes, which is typical in production LLM serving workloads where this speculative decoding logic would be frequently executed.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 7, 2025 12:06
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant