⚡️ Speed up method `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` by 8% #323

codeflash-ai · 2025-11-07T12:06:07Z

📄 8% (0.08x) speedup for `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` in `python/sglang/srt/managers/scheduler_output_processor_mixin.py`

⏱️ Runtime : 141 microseconds → 131 microseconds (best of 119 runs)

📝 Explanation and details

The optimization achieves a 7% speedup through three key micro-optimizations that reduce overhead in the inner loop:

What was optimized:

Pre-allocated list with fixed size: Changed from predict_tokens = [] with repeated .append() calls to predict_tokens = [None] * len(batch_reqs) with direct indexing assignment
Cached attribute lookup: Stored batch.reqs in local variable batch_reqs to avoid repeated attribute access in the loop
Optimized slice calculation: Used explicit start and end variables instead of computing the slice indices inline

Why this leads to speedup:

List pre-allocation eliminates the overhead of dynamic list growing and repeated append() method calls, which is especially beneficial for larger batches
Cached attribute access removes the repeated batch.reqs attribute lookup in each iteration, reducing Python's attribute resolution overhead
Local variable calculations for slice indices are slightly faster than inline arithmetic expressions in Python

Performance characteristics based on test results:

Small batches (1-2 requests): Shows minimal improvement or slight regression due to pre-allocation overhead
Large batches (100+ requests): Shows significant gains of 11-13% faster where the optimizations really pay off
Edge cases: Generally neutral performance impact, maintaining robustness

This optimization is particularly valuable for high-throughput scenarios with larger batch sizes, which is typical in production LLM serving workloads where this speculative decoding logic would be frequently executed.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 26 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


# function to test
class DummyDraftWorker:
    def __init__(self, speculative_num_draft_tokens):
        self.speculative_num_draft_tokens = speculative_num_draft_tokens

class DummyReq:
    def __init__(self):
        self.spec_verify_ct = 0

class DummyScheduler:
    def __init__(self, speculative_num_draft_tokens):
        self.draft_worker = DummyDraftWorker(speculative_num_draft_tokens)

class DummyGenerationBatchResult:
    def __init__(self, last_batch_allocate_lens, accept_lens, next_token_ids):
        # Simulate .tolist() for each attribute
        self.last_batch_allocate_lens = last_batch_allocate_lens
        self.accept_lens = accept_lens
        self.next_token_ids = next_token_ids

class DummyScheduleBatch:
    def __init__(self, reqs):
        self.reqs = reqs
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


# Helper class to simulate .tolist() on lists
class ListWithToList(list):
    def tolist(self):
        return list(self)

# ----------------- UNIT TESTS -----------------
# Basic Test Cases
def test_basic_single_request():
    """
    Test with one request, one draft token, accept one token.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=1)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([1]),  # last_batch_allocate_lens
        ListWithToList([1]),  # accept_lens
        ListWithToList([42])  # next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.46μs -> 2.56μs (3.75% slower)

def test_basic_multiple_requests():
    """
    Test with two requests, two draft tokens per request, different accept_lens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req1 = DummyReq()
    req2 = DummyReq()
    batch = DummyScheduleBatch([req1, req2])
    result = DummyGenerationBatchResult(
        ListWithToList([2, 2]),
        ListWithToList([2, 1]),
        ListWithToList([10, 11, 20, 21])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.47μs -> 2.78μs (11.3% slower)

def test_basic_zero_accept_lens():
    """
    Test with a request that accepts zero tokens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([0]),
        ListWithToList([1, 2])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.10μs -> 2.40μs (12.8% slower)

# Edge Test Cases
def test_edge_empty_batch():
    """
    Test with an empty batch (no requests).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    batch = DummyScheduleBatch([])
    result = DummyGenerationBatchResult(
        ListWithToList([]),
        ListWithToList([]),
        ListWithToList([])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 1.37μs -> 1.64μs (16.7% slower)

def test_edge_accept_lens_greater_than_draft_tokens():
    """
    Test with accept_lens greater than speculative_num_draft_tokens (should not happen, but test robustness).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([3]),  # accept_lens > num_draft_tokens
        ListWithToList([5, 6, 7, 8])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.09μs -> 2.25μs (7.23% slower)

def test_edge_negative_accept_lens():
    """
    Test with negative accept_lens (should not happen, but check for robustness).
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([-1]),
        ListWithToList([9, 10])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.03μs -> 2.25μs (9.90% slower)

def test_edge_large_accept_lens_and_tokens():
    """
    Test with accept_lens and next_token_ids at their maximum allowed (but <1000).
    """
    N = 999
    scheduler = DummyScheduler(speculative_num_draft_tokens=N)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    tokens = list(range(N))
    result = DummyGenerationBatchResult(
        ListWithToList([N]),
        ListWithToList([N]),
        ListWithToList(tokens)
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 5.80μs -> 6.04μs (3.96% slower)

# Large Scale Test Cases

def test_large_scale_varied_accept_lens():
    """
    Test with 100 requests, 10 draft tokens, accept_lens varies from 0 to 9.
    """
    num_reqs = 100
    num_draft_tokens = 10
    scheduler = DummyScheduler(speculative_num_draft_tokens=num_draft_tokens)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyScheduleBatch(reqs)
    last_batch_allocate_lens = ListWithToList([num_draft_tokens] * num_reqs)
    accept_lens = ListWithToList([i % num_draft_tokens for i in range(num_reqs)])
    next_token_ids = []
    for i in range(num_reqs):
        next_token_ids.extend([i*num_draft_tokens + j for j in range(num_draft_tokens)])
    next_token_ids = ListWithToList(next_token_ids)

    result = DummyGenerationBatchResult(
        last_batch_allocate_lens,
        accept_lens,
        next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 20.6μs -> 18.4μs (11.9% faster)
    for i in range(num_reqs):
        expected = [i*num_draft_tokens + j for j in range(accept_lens[i])]

def test_large_scale_zero_accept_lens():
    """
    Test with 100 requests, each with 10 draft tokens, all accept_lens=0.
    """
    num_reqs = 100
    num_draft_tokens = 10
    scheduler = DummyScheduler(speculative_num_draft_tokens=num_draft_tokens)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyScheduleBatch(reqs)
    last_batch_allocate_lens = ListWithToList([num_draft_tokens] * num_reqs)
    accept_lens = ListWithToList([0] * num_reqs)
    next_token_ids = ListWithToList([i for i in range(num_reqs * num_draft_tokens)])

    result = DummyGenerationBatchResult(
        last_batch_allocate_lens,
        accept_lens,
        next_token_ids
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 19.1μs -> 16.8μs (13.2% faster)
    for i in range(num_reqs):
        pass

# Edge case: mismatched lengths
def test_edge_mismatched_lengths():
    """
    Test with mismatched lengths of accept_lens and next_token_ids.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=3)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    # Only 2 tokens, but 3 draft tokens specified
    result = DummyGenerationBatchResult(
        ListWithToList([3]),
        ListWithToList([2]),
        ListWithToList([100, 101])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.00μs -> 2.26μs (11.5% slower)

# Edge case: request object with pre-existing spec_verify_ct
def test_edge_request_with_existing_spec_verify_ct():
    """
    Test that spec_verify_ct increments from a nonzero value.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    req.spec_verify_ct = 5
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([2]),
        ListWithToList([200, 201])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 1.91μs -> 2.17μs (12.2% slower)

# Edge case: negative draft tokens
def test_edge_negative_draft_tokens():
    """
    Test with negative speculative_num_draft_tokens.
    Should not crash, but will not return any tokens.
    """
    scheduler = DummyScheduler(speculative_num_draft_tokens=-2)
    mixin = SchedulerOutputProcessorMixin()
    mixin.draft_worker = scheduler.draft_worker

    req = DummyReq()
    batch = DummyScheduleBatch([req])
    result = DummyGenerationBatchResult(
        ListWithToList([2]),
        ListWithToList([2]),
        ListWithToList([1, 2])
    )
    codeflash_output = mixin.hacky_process_eagle_overlap_result(result, batch); out = codeflash_output # 2.02μs -> 2.11μs (4.21% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

# imports
import pytest
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin


class DummyDraftWorker:
    def __init__(self, speculative_num_draft_tokens):
        self.speculative_num_draft_tokens = speculative_num_draft_tokens

class DummyReq:
    def __init__(self):
        self.spec_verify_ct = 0

class DummyScheduler:
    def __init__(self, speculative_num_draft_tokens):
        self.draft_worker = DummyDraftWorker(speculative_num_draft_tokens)

class DummyBatch:
    def __init__(self, reqs):
        self.reqs = reqs

class DummyResult:
    def __init__(self, last_batch_allocate_lens, accept_lens, next_token_ids):
        # Simulate .tolist() for each attribute
        self.last_batch_allocate_lens = DummyList(last_batch_allocate_lens)
        self.accept_lens = DummyList(accept_lens)
        self.next_token_ids = DummyList(next_token_ids)

class DummyList(list):
    def tolist(self):
        return list(self)
from sglang.srt.managers.scheduler_output_processor_mixin import \
    SchedulerOutputProcessorMixin

# --- End function to test ---

# --- Begin unit tests ---

# Helper function to build inputs
def build_inputs(
    num_reqs,
    speculative_num_draft_tokens,
    last_batch_allocate_lens,
    accept_lens,
    next_token_ids,
):
    scheduler = DummyScheduler(speculative_num_draft_tokens)
    reqs = [DummyReq() for _ in range(num_reqs)]
    batch = DummyBatch(reqs)
    result = DummyResult(last_batch_allocate_lens, accept_lens, next_token_ids)
    return scheduler, result, batch, reqs

# 1. Basic Test Cases

To edit these changes git checkout codeflash/optimize-SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result-mhot7lqo and push.

…sult The optimization achieves a **7% speedup** through three key micro-optimizations that reduce overhead in the inner loop: **What was optimized:** 1. **Pre-allocated list with fixed size**: Changed from `predict_tokens = []` with repeated `.append()` calls to `predict_tokens = [None] * len(batch_reqs)` with direct indexing assignment 2. **Cached attribute lookup**: Stored `batch.reqs` in local variable `batch_reqs` to avoid repeated attribute access in the loop 3. **Optimized slice calculation**: Used explicit `start` and `end` variables instead of computing the slice indices inline **Why this leads to speedup:** - **List pre-allocation** eliminates the overhead of dynamic list growing and repeated `append()` method calls, which is especially beneficial for larger batches - **Cached attribute access** removes the repeated `batch.reqs` attribute lookup in each iteration, reducing Python's attribute resolution overhead - **Local variable calculations** for slice indices are slightly faster than inline arithmetic expressions in Python **Performance characteristics based on test results:** - **Small batches (1-2 requests)**: Shows minimal improvement or slight regression due to pre-allocation overhead - **Large batches (100+ requests)**: Shows significant gains of **11-13% faster** where the optimizations really pay off - **Edge cases**: Generally neutral performance impact, maintaining robustness This optimization is particularly valuable for high-throughput scenarios with larger batch sizes, which is typical in production LLM serving workloads where this speculative decoding logic would be frequently executed.

codeflash-ai bot requested a review from mashraf-222 November 7, 2025 12:06

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` by 8% #323

⚡️ Speed up method `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` by 8% #323

Uh oh!

codeflash-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result by 8% #323

Are you sure you want to change the base?

⚡️ Speed up method SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result by 8% #323

Uh oh!

Conversation

codeflash-ai bot commented Nov 7, 2025

📄 8% (0.08x) speedup for SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result in python/sglang/srt/managers/scheduler_output_processor_mixin.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` by 8% #323

⚡️ Speed up method `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` by 8% #323

📄 8% (0.08x) speedup for `SchedulerOutputProcessorMixin.hacky_process_eagle_overlap_result` in `python/sglang/srt/managers/scheduler_output_processor_mixin.py`