Skip to content

⚡️ Speed up function word_frequency by 22% #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jul 30, 2025

📄 22% (0.22x) speedup for word_frequency in src/dsa/various.py

⏱️ Runtime : 724 microseconds 595 microseconds (best of 930 runs)

📝 Explanation and details

The optimized code achieves a 21% speedup by replacing the manual dictionary construction loop with Python's built-in Counter class from the collections module.

Key optimization applied:

  • Eliminated the manual loop: The original code iterates through each word, checks if it exists in the dictionary (if word in frequency), and either increments or initializes the count. This involves multiple dictionary lookups and assignments.
  • Used Counter's optimized C implementation: Counter is implemented in C and optimized specifically for counting operations, avoiding the overhead of Python's interpreted loop execution.

Why this leads to speedup:
The original code performs O(n) dictionary lookups where each lookup has potential hash collision overhead. The line profiler shows that 64.4% of the total time (33.1% + 31.3%) is spent on the loop iteration and dictionary membership checks. Counter eliminates this by using optimized internal counting mechanisms that batch these operations more efficiently.

Performance characteristics by test case type:

  • Small inputs (< 10 words): Optimized version is actually 50-76% slower due to Counter's initialization overhead outweighing the simple loop benefits
  • Large inputs (500+ words): Optimized version shows 12-70% speedup, with the greatest gains on highly repetitive data (like test_large_repeated_words at 69.9% faster)
  • Medium repetitive datasets: Best performance gains occur when the same words appear multiple times, as Counter's internal optimizations for duplicate counting become more beneficial than the original's repeated dictionary lookups

The optimization trades initialization overhead for loop efficiency, making it most effective on larger datasets with word repetition.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 49 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import random  # used for large scale random generation
import string  # used for punctuation edge cases

# imports
import pytest  # used for our unit tests
from src.dsa.various import word_frequency

# unit tests

# 1. Basic Test Cases

def test_single_word():
    # Single word input
    codeflash_output = word_frequency("hello") # 333ns -> 1.08μs (69.3% slower)

def test_multiple_words():
    # Multiple words, no repeats
    codeflash_output = word_frequency("the quick brown fox") # 500ns -> 1.25μs (60.0% slower)

def test_repeated_words():
    # Words repeated in the string
    codeflash_output = word_frequency("test test test") # 542ns -> 1.29μs (58.0% slower)

def test_case_insensitivity():
    # Should count 'Hello' and 'hello' as the same word
    codeflash_output = word_frequency("Hello hello HELLO") # 541ns -> 1.25μs (56.7% slower)

def test_mixed_case_and_words():
    # Mixed case and different words
    codeflash_output = word_frequency("The cat and the dog") # 625ns -> 1.33μs (53.1% slower)

def test_leading_and_trailing_spaces():
    # Leading and trailing whitespace should not affect result
    codeflash_output = word_frequency("   hello world   ") # 375ns -> 1.17μs (67.9% slower)

def test_multiple_spaces_between_words():
    # Multiple spaces between words should be ignored
    codeflash_output = word_frequency("hello    world") # 375ns -> 1.17μs (67.9% slower)

# 2. Edge Test Cases

def test_empty_string():
    # Empty input should return empty dictionary
    codeflash_output = word_frequency("") # 250ns -> 1.04μs (76.0% slower)

def test_only_spaces():
    # Input with only whitespace should return empty dictionary
    codeflash_output = word_frequency("     ") # 250ns -> 1.08μs (76.9% slower)

def test_only_punctuation():
    # Input with only punctuation should return empty dictionary
    codeflash_output = word_frequency("!!!...,,,") # 333ns -> 1.12μs (70.4% slower)

def test_punctuation_between_words():
    # Punctuation between words should be ignored
    codeflash_output = word_frequency("hello, world! hello?") # 458ns -> 1.25μs (63.4% slower)

def test_punctuation_attached_to_words():
    # Words with attached punctuation should be counted as the word only
    codeflash_output = word_frequency("end. end, end; end:end?") # 500ns -> 1.29μs (61.3% slower)

def test_newlines_and_tabs():
    # Newlines and tabs as whitespace
    codeflash_output = word_frequency("foo\nbar\tbaz foo") # 583ns -> 1.29μs (54.8% slower)

def test_numbers_and_words():
    # Numbers should be counted as words
    codeflash_output = word_frequency("one 1 two 2 2") # 625ns -> 1.25μs (50.0% slower)

def test_underscore_and_alphanumeric():
    # Words with underscores should be counted as words
    codeflash_output = word_frequency("foo_bar foo_bar foo") # 500ns -> 1.29μs (61.3% slower)

def test_non_ascii_letters():
    # Non-ASCII letters should be counted as words
    codeflash_output = word_frequency("naïve café résumé") # 625ns -> 1.38μs (54.5% slower)

def test_hyphenated_words():
    # Hyphenated words are split into separate words (since \w doesn't match '-')
    codeflash_output = word_frequency("mother-in-law") # 333ns -> 1.17μs (71.5% slower)

def test_apostrophes_in_words():
    # Apostrophes are not part of \w, so "it's" becomes "it" and "s"
    codeflash_output = word_frequency("it's John's book") # 417ns -> 1.25μs (66.6% slower)

def test_long_word():
    # Very long word should be handled
    long_word = "a" * 100
    codeflash_output = word_frequency(long_word) # 416ns -> 1.25μs (66.7% slower)

def test_unicode_emoji():
    # Emojis are not counted as words
    codeflash_output = word_frequency("hello 😊 world 😊") # 708ns -> 1.46μs (51.4% slower)

def test_mixed_language():
    # Mixed language input
    codeflash_output = word_frequency("hello 你好 hello") # 625ns -> 1.38μs (54.5% slower)

def test_word_with_digits():
    # Words containing digits
    codeflash_output = word_frequency("abc123 123abc 123") # 417ns -> 1.17μs (64.3% slower)

# 3. Large Scale Test Cases

def test_large_unique_words():
    # 1000 unique words
    words = [f"word{i}" for i in range(1000)]
    text = " ".join(words)
    expected = {w: 1 for w in words}
    codeflash_output = word_frequency(text) # 54.6μs -> 48.7μs (12.2% faster)

def test_large_repeated_words():
    # 10 words, each repeated 100 times (total 1000 words)
    base_words = [f"word{i}" for i in range(10)]
    words = base_words * 100
    random.shuffle(words)
    text = " ".join(words)
    expected = {w: 100 for w in base_words}
    codeflash_output = word_frequency(text) # 67.0μs -> 39.5μs (69.9% faster)

def test_large_with_punctuation_and_case():
    # 500 words, each with random punctuation and case
    base_words = [f"test{i}" for i in range(500)]
    text = ""
    for w in base_words:
        # Randomly capitalize and add punctuation
        word = w.upper() if random.choice([True, False]) else w.lower()
        punct = random.choice(["", ".", ",", "!", "?"])
        text += word + punct + " "
    expected = {w: 1 for w in base_words}
    codeflash_output = word_frequency(text) # 26.5μs -> 23.9μs (11.0% faster)

def test_large_with_numbers_and_underscores():
    # 1000 words with numbers and underscores
    words = [f"word_{i}_123" for i in range(1000)]
    text = " ".join(words)
    expected = {w: 1 for w in words}
    codeflash_output = word_frequency(text) # 59.6μs -> 52.1μs (14.4% faster)

def test_large_mixed_content():
    # 500 unique words, each repeated twice, mixed with punctuation, numbers, and whitespace
    base_words = [f"w{i}" for i in range(500)]
    words = []
    for w in base_words:
        words.append(w)
        words.append(w)
    random.shuffle(words)
    text = "  ".join(words) + "\n" + " , ".join(words) + "\t" + " ".join(str(i) for i in range(500))
    expected = {w: 2 for w in base_words}
    for i in range(500):
        expected[str(i)] = 1
    codeflash_output = word_frequency(text) # 207μs -> 160μs (29.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import random  # used for large scale random text generation
import string  # used for punctuation edge cases

# imports
import pytest  # used for our unit tests
from src.dsa.various import word_frequency

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_empty_string_returns_empty_dict():
    # Test that an empty string returns an empty dictionary
    codeflash_output = word_frequency("") # 250ns -> 1.08μs (76.9% slower)

def test_single_word():
    # Test a string with a single word
    codeflash_output = word_frequency("hello") # 333ns -> 1.17μs (71.4% slower)

def test_multiple_words():
    # Test a string with multiple distinct words
    codeflash_output = word_frequency("hello world") # 416ns -> 1.21μs (65.6% slower)

def test_repeated_words():
    # Test a string with repeated words
    codeflash_output = word_frequency("hello hello world") # 541ns -> 1.25μs (56.7% slower)

def test_case_insensitivity():
    # Test that the function is case-insensitive
    codeflash_output = word_frequency("Hello hELLo HELLO") # 542ns -> 1.25μs (56.6% slower)

def test_mixed_case_and_repeats():
    # Test mixed case and repeated words
    codeflash_output = word_frequency("The quick brown fox jumps over the lazy dog the THE") # 1.00μs -> 1.62μs (38.5% slower)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_punctuation_ignored():
    # Test that punctuation is ignored and words are counted correctly
    codeflash_output = word_frequency("hello, world! hello.") # 458ns -> 1.25μs (63.4% slower)

def test_only_punctuation():
    # Test that a string of only punctuation returns an empty dict
    codeflash_output = word_frequency("!@#$%^&*()") # 333ns -> 1.12μs (70.4% slower)

def test_numbers_as_words():
    # Test that numbers are treated as words
    codeflash_output = word_frequency("123 456 123") # 541ns -> 1.25μs (56.7% slower)

def test_alphanumeric_words():
    # Test words with numbers and letters mixed
    codeflash_output = word_frequency("abc123 123abc abc123") # 541ns -> 1.29μs (58.1% slower)

def test_words_with_apostrophes_and_hyphens():
    # Test that apostrophes and hyphens are split (since \w doesn't include them)
    # "don't" -> "don", "t"; "mother-in-law" -> "mother", "in", "law"
    codeflash_output = word_frequency("don't mother-in-law") # 375ns -> 1.21μs (69.0% slower)

def test_leading_and_trailing_spaces():
    # Test that leading/trailing/multiple spaces are ignored
    codeflash_output = word_frequency("   hello   world   hello   ") # 541ns -> 1.25μs (56.7% slower)

def test_tab_and_newline_separators():
    # Test that tabs and newlines are treated as word separators
    codeflash_output = word_frequency("hello\tworld\nhello") # 541ns -> 1.25μs (56.7% slower)

def test_unicode_letters():
    # Test that unicode words are counted (e.g., accented characters)
    # \w includes unicode letters in Python 3
    codeflash_output = word_frequency("café Café CAFÉ") # 667ns -> 1.38μs (51.5% slower)

def test_mixed_nonword_characters():
    # Test that non-word characters split words
    codeflash_output = word_frequency("word1@word2#word3") # 334ns -> 1.17μs (71.4% slower)

def test_long_word():
    # Test a very long word
    long_word = "a" * 1000
    codeflash_output = word_frequency(long_word) # 1.33μs -> 2.17μs (38.5% slower)

def test_word_with_underscore():
    # Test that underscores are included as part of words
    codeflash_output = word_frequency("foo_bar foo_bar foo") # 541ns -> 1.29μs (58.1% slower)

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_text_all_same_word():
    # Test a large input with the same word repeated 1000 times
    text = "hello " * 1000
    codeflash_output = word_frequency(text) # 54.8μs -> 41.5μs (31.9% faster)

def test_large_text_many_unique_words():
    # Test a large input with 1000 unique words
    words = [f"word{i}" for i in range(1000)]
    text = " ".join(words)
    expected = {w: 1 for w in words}
    codeflash_output = word_frequency(text) # 51.2μs -> 45.3μs (12.9% faster)

def test_large_text_mixed_repeats():
    # Test a large input with 500 unique words, each repeated twice
    words = [f"word{i}" for i in range(500)]
    text = " ".join(words * 2)
    expected = {w: 2 for w in words}
    codeflash_output = word_frequency(text) # 58.7μs -> 48.2μs (21.8% faster)

def test_large_random_text():
    # Test a large input with 1000 words randomly chosen from a set of 10
    base_words = [f"w{i}" for i in range(10)]
    random.seed(42)
    words = [random.choice(base_words) for _ in range(1000)]
    text = " ".join(words)
    # Calculate expected frequencies
    expected = {}
    for w in words:
        expected[w] = expected.get(w, 0) + 1
    codeflash_output = word_frequency(text) # 61.7μs -> 38.8μs (58.8% faster)

def test_large_text_with_punctuation_and_case():
    # Test a large input with punctuation and varying case
    base_words = ["Alpha", "beta", "GAMMA", "delta"]
    punctuations = [".", ",", "!", "?", ";", ":"]
    random.seed(0)
    text = " ".join(
        f"{random.choice(base_words)}{random.choice(punctuations)}"
        for _ in range(1000)
    )
    # All words should be lowercased and punctuation ignored
    expected = {w.lower(): 0 for w in base_words}
    for _ in range(1000):
        w = random.choice(base_words)
        expected[w.lower()] += 1
    # But since we generate the text above, we must parse it accordingly
    # Let's count the words as they appear in the generated text
    actual_expected = {}
    for token in text.split():
        # Remove punctuation and lowercase
        word = token.strip("".join(punctuations)).lower()
        actual_expected[word] = actual_expected.get(word, 0) + 1
    codeflash_output = word_frequency(text) # 62.5μs -> 45.9μs (36.2% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from src.dsa.various import word_frequency

def test_word_frequency():
    word_frequency('Ĩ')

To edit these changes git checkout codeflash/optimize-word_frequency-mdpcmb3p and push.

Codeflash

The optimized code achieves a **21% speedup** by replacing the manual dictionary construction loop with Python's built-in `Counter` class from the collections module.

**Key optimization applied:**
- **Eliminated the manual loop**: The original code iterates through each word, checks if it exists in the dictionary (`if word in frequency`), and either increments or initializes the count. This involves multiple dictionary lookups and assignments.
- **Used Counter's optimized C implementation**: `Counter` is implemented in C and optimized specifically for counting operations, avoiding the overhead of Python's interpreted loop execution.

**Why this leads to speedup:**
The original code performs O(n) dictionary lookups where each lookup has potential hash collision overhead. The line profiler shows that 64.4% of the total time (33.1% + 31.3%) is spent on the loop iteration and dictionary membership checks. Counter eliminates this by using optimized internal counting mechanisms that batch these operations more efficiently.

**Performance characteristics by test case type:**
- **Small inputs (< 10 words)**: Optimized version is actually **50-76% slower** due to Counter's initialization overhead outweighing the simple loop benefits
- **Large inputs (500+ words)**: Optimized version shows **12-70% speedup**, with the greatest gains on highly repetitive data (like `test_large_repeated_words` at 69.9% faster)
- **Medium repetitive datasets**: Best performance gains occur when the same words appear multiple times, as Counter's internal optimizations for duplicate counting become more beneficial than the original's repeated dictionary lookups

The optimization trades initialization overhead for loop efficiency, making it most effective on larger datasets with word repetition.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jul 30, 2025
@codeflash-ai codeflash-ai bot requested a review from aseembits93 July 30, 2025 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants