Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 7, 2025

📄 50% (0.50x) speedup for prefix_hold in python/sglang/srt/parser/harmony_parser.py

⏱️ Runtime : 2.65 milliseconds 1.77 milliseconds (best of 250 runs)

📝 Explanation and details

The optimized version achieves a 50% speedup by restructuring the algorithm to eliminate redundant work and reduce total iterations.

Key optimizations:

  1. Pre-filtering and caching: Filters out empty tokens once upfront (filtered_tokens = [tok for tok in tokens if tok]) and calculates the maximum token length once, avoiding repeated empty token checks in inner loops.

  2. Reversed iteration strategy: Instead of iterating through each token and checking all possible suffix lengths, it iterates through suffix lengths in decreasing order (longest first) and stops immediately when a match is found. This exploits the fact that we only need the longest matching suffix.

  3. Early termination: When a match is found for a given suffix length k, it immediately breaks from both the token loop and the outer k loop, avoiding unnecessary comparisons.

Performance impact by workload:

  • Large token lists with few matches (like test_large_many_tokens): Shows dramatic improvements (200%+ speedup) because the algorithm can quickly skip through many tokens once it finds the optimal suffix length
  • Small inputs or no matches: Shows modest slowdowns (15-45%) due to the overhead of pre-processing, but these cases are typically fast enough that the absolute time difference is negligible
  • Empty or mostly empty token lists: Benefits from avoiding repeated empty checks

The function is called in parsing hot paths where it processes text chunks and determines how much to emit versus hold for potential token completion. The optimization particularly benefits scenarios with many guard tokens (like self.guard_tokens in the parser), making text streaming more efficient in production parsing workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 60 Passed
🌀 Generated Regression Tests 79 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 3 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
🌀 Generated Regression Tests and Runtime
from typing import List, Tuple

# imports
import pytest  # used for our unit tests
from sglang.srt.parser.harmony_parser import prefix_hold

# unit tests

# -------- BASIC TEST CASES --------

def test_empty_text_and_tokens():
    # Both text and tokens are empty
    codeflash_output = prefix_hold("", []) # 356ns -> 345ns (3.19% faster)

def test_empty_text_nonempty_tokens():
    # Text is empty, tokens are not
    codeflash_output = prefix_hold("", ["a", "b"]) # 334ns -> 303ns (10.2% faster)

def test_nonempty_text_empty_tokens():
    # Text is not empty, tokens are empty
    codeflash_output = prefix_hold("hello", []) # 536ns -> 2.11μs (74.5% slower)

def test_no_overlap():
    # No suffix of text is a prefix of any token
    codeflash_output = prefix_hold("hello", ["world", "test"]) # 3.69μs -> 4.72μs (21.8% slower)

def test_partial_overlap():
    # "llo" is a prefix of "lloworld"
    codeflash_output = prefix_hold("hello", ["lloworld", "test"]) # 3.42μs -> 4.15μs (17.7% slower)

def test_full_overlap():
    # The entire text is a prefix of a token
    codeflash_output = prefix_hold("hel", ["hello", "help"]) # 2.75μs -> 3.35μs (17.8% slower)

def test_multiple_tokens_overlap():
    # "lo" is a prefix of "loud", "llo" is a prefix of "lloworld"
    # The function should hold the longest: "llo"
    codeflash_output = prefix_hold("hello", ["loud", "lloworld", "test"]) # 3.80μs -> 4.22μs (9.86% slower)

def test_overlap_at_start_of_token():
    # "abc" is a prefix of "abcdef"
    codeflash_output = prefix_hold("abc", ["abcdef", "xyz"]) # 2.85μs -> 3.19μs (10.6% slower)

def test_single_character_overlap():
    # Only the last character overlaps with the start of a token
    codeflash_output = prefix_hold("abc", ["cdef", "xyz"]) # 3.04μs -> 3.91μs (22.2% slower)

def test_token_is_empty_string():
    # Tokens contain an empty string, which should be ignored
    codeflash_output = prefix_hold("abc", ["", "def"]) # 1.79μs -> 3.24μs (44.7% slower)

# -------- EDGE TEST CASES --------

def test_text_shorter_than_token():
    # Text is shorter than tokens, but is a prefix of a token
    codeflash_output = prefix_hold("ab", ["abc", "abd"]) # 2.71μs -> 3.22μs (15.8% slower)

def test_text_longer_than_token():
    # Text is longer than all tokens, but suffix matches prefix of a token
    codeflash_output = prefix_hold("xyzabc", ["ab", "abc"]) # 2.50μs -> 3.37μs (25.8% slower)

def test_token_is_single_character():
    # Suffix is a single character matching a token's prefix
    codeflash_output = prefix_hold("abc", ["a", "b", "c"]) # 1.88μs -> 2.56μs (26.4% slower)

def test_all_tokens_are_empty():
    # All tokens are empty strings
    codeflash_output = prefix_hold("abc", ["", "", ""]) # 643ns -> 1.76μs (63.5% slower)

def test_all_tokens_are_substrings():
    # All tokens are substrings of text, but not prefixes
    codeflash_output = prefix_hold("abcdef", ["cd", "ef", "de"]) # 2.88μs -> 3.58μs (19.5% slower)

def test_text_and_token_identical():
    # Text is exactly the same as a token
    codeflash_output = prefix_hold("token", ["token", "other"]) # 2.92μs -> 4.10μs (28.8% slower)

def test_suffix_equal_to_token_prefix():
    # Suffix of text matches prefix of a token, but not the entire text
    codeflash_output = prefix_hold("xyztoken", ["token", "other"]) # 2.82μs -> 3.86μs (26.9% slower)

def test_multiple_possible_matches():
    # Suffix matches multiple token prefixes, should pick the longest
    codeflash_output = prefix_hold("abcde", ["de", "cde", "e"]) # 2.83μs -> 3.58μs (20.9% slower)

def test_token_is_longer_than_text():
    # Token is much longer than text, but text is a prefix of token
    codeflash_output = prefix_hold("abc", ["abcdefg"]) # 2.17μs -> 3.16μs (31.4% slower)

def test_text_and_tokens_with_special_characters():
    # Special characters in text and tokens
    codeflash_output = prefix_hold("foo_bar", ["bar", "baz", "_bar"]) # 2.93μs -> 3.88μs (24.4% slower)

def test_unicode_characters():
    # Unicode characters in text and tokens
    codeflash_output = prefix_hold("café", ["fé", "café", "é"]) # 3.22μs -> 4.14μs (22.1% slower)

def test_suffix_overlap_with_multiple_tokens():
    # Suffix matches prefix of multiple tokens, longest should be chosen
    codeflash_output = prefix_hold("testing", ["ing", "testing", "g"]) # 3.19μs -> 4.64μs (31.3% slower)

def test_text_with_repeated_characters():
    # Repeated characters at the end that match token prefixes
    codeflash_output = prefix_hold("helloooo", ["oo", "ooo", "o"]) # 3.22μs -> 3.35μs (3.80% slower)

# -------- LARGE SCALE TEST CASES --------

def test_large_text_and_tokens_no_overlap():
    # Large text and tokens, no overlap
    text = "a" * 500 + "b" * 500
    tokens = ["c" * 100 for _ in range(10)]
    codeflash_output = prefix_hold(text, tokens) # 72.2μs -> 64.1μs (12.6% faster)

def test_large_text_and_tokens_with_overlap():
    # Large text with a suffix that matches prefix of a token
    text = "x" * 995 + "hello"
    tokens = ["hello", "world", "x" * 1000]
    codeflash_output = prefix_hold(text, tokens) # 86.9μs -> 191μs (54.7% slower)

def test_large_number_of_tokens():
    # Many tokens, some of which could match the suffix
    text = "abc" * 300 + "xyz"
    tokens = ["xyz"] * 999
    codeflash_output = prefix_hold(text, tokens) # 311μs -> 148μs (110% faster)

def test_longest_possible_hold_among_many_tokens():
    # Suffix matches the longest prefix among many tokens
    text = "a" * 990 + "longest"
    tokens = ["long", "longer", "longest", "lo", "l"]
    codeflash_output = prefix_hold(text, tokens) # 4.32μs -> 5.39μs (19.7% slower)

def test_performance_with_large_inputs():
    # Stress test: large inputs, ensure function completes and is correct
    text = "abc" * 333 + "def"
    tokens = ["def", "xyz", "abc" * 333 + "d"]
    codeflash_output = prefix_hold(text, tokens) # 88.9μs -> 191μs (53.6% slower)

def test_large_text_with_multiple_partial_overlaps():
    # Large text, multiple possible overlaps, longest should be chosen
    text = "a" * 995 + "overlap"
    tokens = ["lap", "ap", "overlap", "erlap"]
    codeflash_output = prefix_hold(text, tokens) # 3.87μs -> 5.05μs (23.5% slower)

# -------- ADDITIONAL EDGE CASES --------

def test_text_is_one_character():
    # Text is a single character, tokens may or may not match
    codeflash_output = prefix_hold("a", ["a", "b"]) # 1.69μs -> 2.49μs (32.4% slower)
    codeflash_output = prefix_hold("a", ["b", "c"]) # 697ns -> 1.08μs (35.3% slower)

def test_token_is_one_character():
    # Token is a single character, text is longer
    codeflash_output = prefix_hold("abc", ["a"]) # 1.21μs -> 2.25μs (45.9% slower)
    codeflash_output = prefix_hold("abc", ["c"]) # 513ns -> 965ns (46.8% slower)

def test_token_is_substring_of_text_but_not_prefix():
    # Token is substring of text, but not a prefix
    codeflash_output = prefix_hold("abcdef", ["cde"]) # 1.94μs -> 3.30μs (41.1% slower)

def test_text_and_tokens_are_spaces():
    # Text and tokens are whitespace
    codeflash_output = prefix_hold("   ", [" ", "  "]) # 2.75μs -> 3.37μs (18.4% slower)
    codeflash_output = prefix_hold("   ", ["a", "b"]) # 860ns -> 1.42μs (39.4% slower)

def test_text_and_tokens_are_identical_long():
    # Text and token are identical and long
    text = "x" * 999
    tokens = ["x" * 999]
    codeflash_output = prefix_hold(text, tokens) # 3.05μs -> 4.19μs (27.0% slower)

def test_tokens_with_empty_and_nonempty():
    # Tokens list has empty and non-empty tokens
    codeflash_output = prefix_hold("abc", ["", "ab", "bc"]) # 2.38μs -> 3.27μs (27.2% slower)

def test_tokens_with_duplicates():
    # Tokens list has duplicates
    codeflash_output = prefix_hold("abcabc", ["abc", "abc", "bc"]) # 2.80μs -> 3.69μs (24.1% slower)

def test_multiple_holds_same_length():
    # Multiple tokens match the same length, should still hold that length
    codeflash_output = prefix_hold("foobar", ["bar", "baz", "bap"]) # 2.76μs -> 3.55μs (22.2% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import List, Tuple

# imports
import pytest  # used for our unit tests
from sglang.srt.parser.harmony_parser import prefix_hold

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_basic_no_hold():
    # No suffix of text matches any token's prefix
    codeflash_output = prefix_hold("hello", ["world", "test"]) # 2.88μs -> 4.09μs (29.6% slower)

def test_basic_simple_hold():
    # Suffix "he" is prefix of "hello"
    codeflash_output = prefix_hold("the", ["hello", "world"]) # 3.18μs -> 3.81μs (16.4% slower)

def test_basic_full_match():
    # Suffix "hello" is full prefix of "hello"
    codeflash_output = prefix_hold("hello", ["hello"]) # 2.03μs -> 3.50μs (41.9% slower)

def test_basic_multiple_tokens():
    # Suffix "wor" is prefix of "world", but "hello" matches nothing
    codeflash_output = prefix_hold("wor", ["world", "hello"]) # 2.91μs -> 3.24μs (10.1% slower)

def test_basic_partial_overlap():
    # Suffix "lo" matches prefix of "long"
    codeflash_output = prefix_hold("hello", ["long", "test"]) # 2.88μs -> 3.62μs (20.5% slower)

def test_basic_empty_tokens():
    # No tokens to match, should emit all
    codeflash_output = prefix_hold("abc", []) # 508ns -> 1.84μs (72.4% slower)

def test_basic_empty_token_in_tokens():
    # Ignore empty tokens, should emit all
    codeflash_output = prefix_hold("abc", ["", ""]) # 615ns -> 1.79μs (65.6% slower)

def test_basic_token_shorter_than_text():
    # Suffix "ab" matches prefix of "ab"
    codeflash_output = prefix_hold("cab", ["ab", "ba"]) # 3.19μs -> 3.74μs (14.7% slower)

def test_basic_token_same_as_text():
    # Suffix "abc" matches prefix of "abc"
    codeflash_output = prefix_hold("abc", ["abc"]) # 1.96μs -> 3.29μs (40.6% slower)

def test_basic_token_is_substring():
    # Suffix "bc" matches prefix of "bcd"
    codeflash_output = prefix_hold("abc", ["bcd"]) # 2.25μs -> 3.26μs (31.1% slower)

# -------------------- EDGE TEST CASES --------------------

def test_edge_empty_text():
    # Empty text, should emit nothing and hold nothing
    codeflash_output = prefix_hold("", ["abc", "def"]) # 326ns -> 299ns (9.03% faster)

def test_edge_token_empty_and_nonempty():
    # One token is empty, should ignore it
    codeflash_output = prefix_hold("abc", ["", "ab"]) # 2.16μs -> 3.21μs (32.6% slower)

def test_edge_text_shorter_than_tokens():
    # Text shorter than all tokens, should still match prefixes
    codeflash_output = prefix_hold("a", ["apple", "anchor"]) # 3.01μs -> 3.49μs (13.7% slower)

def test_edge_text_equals_token():
    # Text equals token, should hold all
    codeflash_output = prefix_hold("token", ["token"]) # 2.08μs -> 3.70μs (43.7% slower)

def test_edge_token_is_empty_string():
    # Only token is empty string, should emit all
    codeflash_output = prefix_hold("abc", [""]) # 606ns -> 1.82μs (66.7% slower)

def test_edge_token_is_longer_than_text():
    # Token is longer, but prefix matches
    codeflash_output = prefix_hold("abc", ["abcdef"]) # 2.42μs -> 3.34μs (27.4% slower)

def test_edge_multiple_possible_holds():
    # Suffix "lo" matches "long", "o" matches "orange", should hold the longest ("lo")
    codeflash_output = prefix_hold("hello", ["long", "orange"]) # 3.68μs -> 4.18μs (11.8% slower)

def test_edge_multiple_tokens_same_prefix():
    # Suffix "ab" matches both "abc" and "abx", should hold "ab"
    codeflash_output = prefix_hold("cab", ["abc", "abx"]) # 2.73μs -> 3.20μs (14.5% slower)

def test_edge_all_tokens_empty():
    # All tokens empty, emit all
    codeflash_output = prefix_hold("abc", ["", "", ""]) # 664ns -> 1.85μs (64.2% slower)

def test_edge_text_and_tokens_empty():
    # Both text and tokens empty
    codeflash_output = prefix_hold("", [""]) # 323ns -> 315ns (2.54% faster)

def test_edge_text_single_char_tokens_single_char():
    # Suffix "a" matches prefix of "a"
    codeflash_output = prefix_hold("a", ["a", "b"]) # 2.05μs -> 2.51μs (18.3% slower)

def test_edge_token_is_single_char():
    # Suffix "c" matches prefix of "c"
    codeflash_output = prefix_hold("abc", ["c", "d"]) # 1.67μs -> 2.37μs (29.5% slower)

def test_edge_token_is_prefix_of_text_but_not_suffix():
    # "ab" is prefix of "abc", but not suffix, so emit all
    codeflash_output = prefix_hold("abc", ["ab"]) # 2.08μs -> 3.31μs (37.3% slower)

def test_edge_token_is_substring_but_not_prefix():
    # "b" is in "abc", but not as prefix, so emit all
    codeflash_output = prefix_hold("abc", ["b"]) # 1.30μs -> 2.30μs (43.7% slower)

def test_edge_text_is_substring_of_token():
    # "ab" is prefix of "abc", should hold all
    codeflash_output = prefix_hold("ab", ["abc"]) # 2.37μs -> 3.22μs (26.5% slower)

def test_edge_token_is_suffix_of_text_but_not_prefix():
    # "bc" is suffix of "abc", but not prefix of any token, so emit all
    codeflash_output = prefix_hold("abc", ["cb"]) # 2.35μs -> 3.32μs (29.0% slower)

def test_edge_text_has_multiple_suffixes_matching():
    # Suffix "bc" matches "bcd", "c" matches "cd", should hold "bc"
    codeflash_output = prefix_hold("abc", ["bcd", "cd"]) # 3.12μs -> 3.33μs (6.54% slower)

def test_edge_text_has_no_match():
    # No match at all
    codeflash_output = prefix_hold("xyz", ["abc", "def"]) # 2.58μs -> 3.62μs (28.8% slower)

def test_edge_token_is_prefix_of_text_and_suffix():
    # "ab" is both prefix and suffix, but only suffix matters
    codeflash_output = prefix_hold("ab", ["ab"]) # 1.76μs -> 2.95μs (40.4% slower)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_large_many_tokens():
    # Create 1000 tokens with common prefix "abc", text ends with "abc"
    tokens = ["abc" + str(i) for i in range(1000)]
    codeflash_output = prefix_hold("xyzabc", tokens) # 436μs -> 145μs (200% faster)

def test_large_long_text():
    # Long text ending with "hello"
    text = "a" * 995 + "hello"
    tokens = ["hello", "world"]
    codeflash_output = prefix_hold(text, tokens) # 2.84μs -> 4.01μs (29.2% slower)

def test_large_tokens_with_various_prefixes():
    # 500 tokens with "foo", 500 with "bar", text ends with "bar"
    tokens = ["foo" + str(i) for i in range(500)] + ["bar" + str(i) for i in range(500)]
    codeflash_output = prefix_hold("testbar", tokens) # 462μs -> 169μs (174% faster)

def test_large_tokens_no_match():
    # 1000 tokens, none matching
    tokens = ["tok" + str(i) for i in range(1000)]
    codeflash_output = prefix_hold("abcdef", tokens) # 500μs -> 299μs (67.3% faster)

def test_large_tokens_with_partial_match():
    # 1000 tokens, only last few match
    tokens = ["tok" + str(i) for i in range(995)] + ["def", "de", "d", "xyz"]
    codeflash_output = prefix_hold("abcdef", tokens) # 499μs -> 300μs (66.2% faster)

def test_large_text_and_tokens_all_empty():
    # Large number of empty tokens, empty text
    tokens = [""] * 1000
    codeflash_output = prefix_hold("", tokens) # 328ns -> 316ns (3.80% faster)

def test_large_text_and_tokens_some_empty():
    # Large number of empty tokens, non-empty text
    tokens = [""] * 1000
    codeflash_output = prefix_hold("abc", tokens) # 9.52μs -> 10.2μs (6.29% slower)

def test_large_text_and_tokens_mixed():
    # Mixed tokens, some empty, some matching
    tokens = [""] * 500 + ["abc", "ab", "a"] + [""] * 497
    codeflash_output = prefix_hold("abc", tokens) # 12.6μs -> 12.1μs (4.43% faster)

def test_large_text_partial_match():
    # Long text, partial match at end
    text = "x" * 995 + "ab"
    tokens = ["ab", "cd"]
    codeflash_output = prefix_hold(text, tokens) # 2.36μs -> 3.39μs (30.4% slower)

def test_large_text_no_match():
    # Long text, no match
    text = "x" * 1000
    tokens = ["abc", "def"]
    codeflash_output = prefix_hold(text, tokens) # 2.53μs -> 3.54μs (28.5% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from sglang.srt.parser.harmony_parser import prefix_hold

def test_prefix_hold():
    prefix_hold('\x00', ['\x00\x00\x00', '', '\x01\x00\x00'])

def test_prefix_hold_2():
    prefix_hold('\x00', [])

def test_prefix_hold_3():
    prefix_hold('', [])
🔎 Concolic Coverage Tests and Runtime

To edit these changes git checkout codeflash/optimize-prefix_hold-mhonm601 and push.

Codeflash Static Badge

The optimized version achieves a **50% speedup** by restructuring the algorithm to eliminate redundant work and reduce total iterations.

**Key optimizations:**

1. **Pre-filtering and caching**: Filters out empty tokens once upfront (`filtered_tokens = [tok for tok in tokens if tok]`) and calculates the maximum token length once, avoiding repeated empty token checks in inner loops.

2. **Reversed iteration strategy**: Instead of iterating through each token and checking all possible suffix lengths, it iterates through suffix lengths in decreasing order (longest first) and stops immediately when a match is found. This exploits the fact that we only need the longest matching suffix.

3. **Early termination**: When a match is found for a given suffix length `k`, it immediately breaks from both the token loop and the outer `k` loop, avoiding unnecessary comparisons.

**Performance impact by workload:**
- **Large token lists with few matches** (like `test_large_many_tokens`): Shows dramatic improvements (200%+ speedup) because the algorithm can quickly skip through many tokens once it finds the optimal suffix length
- **Small inputs or no matches**: Shows modest slowdowns (15-45%) due to the overhead of pre-processing, but these cases are typically fast enough that the absolute time difference is negligible
- **Empty or mostly empty token lists**: Benefits from avoiding repeated empty checks

The function is called in parsing hot paths where it processes text chunks and determines how much to emit versus hold for potential token completion. The optimization particularly benefits scenarios with many guard tokens (like `self.guard_tokens` in the parser), making text streaming more efficient in production parsing workflows.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 7, 2025 09:29
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant