Skip to content

Conversation

@KRRT7
Copy link
Contributor

@KRRT7 KRRT7 commented Nov 10, 2025

PR Type

Enhancement, Bug fix


Description

  • Auto-remove failing generated regression tests

  • Add async safety by pruning bad tests early

  • Insert console rule after cleanup step

  • Minor typing: import Pattern for regex use


Diagram Walkthrough

flowchart LR
  setup["Test setup completed"] -- "run behavior tests" --> run["Run and parse tests"]
  run -- "collect failing generated tests" --> collect["Failing test names"]
  collect -- "compile regex patterns" --> compile["Compile function patterns"]
  compile -- "remove from sources" --> apply["Delete tests in files"]
  apply -- "log and continue" --> proceed["Establish baseline & optimize"]
Loading

File Walkthrough

Relevant files
Enhancement
function_optimizer.py
Pre-baseline pruning of failing generated tests                   

codeflash/optimization/function_optimizer.py

  • Add remove_failing_tests to prune bad generated tests
  • Add remove_tests_from_source with regex-based deletion
  • Invoke removal before establishing baseline; add console rule
  • Add TYPE_CHECKING import for Pattern
+56/-0   

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

PR Reviewer Guide 🔍

(Review updated until commit cef9891)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

When no failing tests are found, function patterns are still compiled and removal loops run; also the info log says all generated tests pass but proceeds silently to attempt removals. Consider returning early to avoid unnecessary file reads/writes and keep logs consistent.

if not failing_test_names:
    logger.info("All generated tests pass ✅")

function_patterns = _compile_function_patterns(failing_test_names)

for test_file in self.test_files.test_files:
    if test_file.test_type != TestType.GENERATED_REGRESSION:
        continue

    for file_path in [test_file.instrumented_behavior_file_path, test_file.benchmarking_file_path]:
        if file_path and file_path.exists():
            source = file_path.read_text(encoding="utf-8")
            source = self.remove_tests_from_source(source, function_patterns)
            file_path.write_text(source, encoding="utf-8")

    if test_file.original_source:
        test_file.original_source = self.remove_tests_from_source(test_file.original_source, function_patterns)

if failing_test_names:
Robustness

The test-name-to-pattern compilation and subsequent regex-based deletion may match inside parametrized tests or non-test code segments; while there is a guard for '@pytest.mark.parametrize', other decorators or multiline definitions may still be altered inadvertently. Consider anchoring patterns more strictly or parsing with AST.

def remove_tests_from_source(self, source: str, function_patterns: list[Pattern]) -> str:
    for pattern in function_patterns:
        match = pattern.search(source)
        while match:
            if "@pytest.mark.parametrize" in match.group(0):
                match = pattern.search(source, match.end())
                continue
            start, end = match.span()
            source = source[:start] + source[end:]
            match = pattern.search(source, start)
    return source
Missing Import Use

TYPE_CHECKING import of Pattern suggests type usage; ensure the annotation 'list[Pattern]' refers to 're.Pattern' at runtime or add a typing import fallback to avoid NameError in type checking or runtime tools that evaluate annotations.

if TYPE_CHECKING:
    from argparse import Namespace
    from re import Pattern

    from codeflash.discovery.functions_to_optimize import FunctionToOptimize
    from codeflash.either import Result

@github-actions
Copy link

github-actions bot commented Nov 10, 2025

PR Code Suggestions ✨

Latest suggestions up to cef9891
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix potential matching loop

Prevent infinite loops and partial deletions by advancing the index on the unchanged
source when skipping parametrized tests. Also guard against overlapping patterns by
always moving the cursor forward after non-mutating matches.

codeflash/optimization/function_optimizer.py [1604-1614]

 def remove_tests_from_source(self, source: str, function_patterns: list[Pattern]) -> str:
     for pattern in function_patterns:
-        match = pattern.search(source)
-        while match:
+        pos = 0
+        while True:
+            match = pattern.search(source, pos)
+            if not match:
+                break
             if "@pytest.mark.parametrize" in match.group(0):
-                match = pattern.search(source, match.end())
+                # Advance past this match to avoid re-matching at the same position
+                pos = match.end()
                 continue
             start, end = match.span()
             source = source[:start] + source[end:]
-            match = pattern.search(source, start)
+            # Continue searching from the deletion point
+            pos = start
     return source
Suggestion importance[1-10]: 7

__

Why: Adjusting the search position prevents re-matching on skipped parametrized cases and reduces risk of looping; this is a targeted robustness improvement aligned with the existing logic.

Medium
Add early return guard

Return early when there are no failing tests to avoid compiling empty patterns and
mutating files unnecessarily. This prevents accidental edits and redundant work when
failing_test_names is empty.

codeflash/optimization/function_optimizer.py [1564-1587]

 def remove_failing_tests(self) -> None:
     from codeflash.code_utils.edit_generated_tests import _compile_function_patterns
 
     test_env = self.get_test_env(codeflash_loop_index=0, codeflash_test_iteration=0, codeflash_tracer_disable=1)
     behavioral_results, _ = self.run_and_parse_tests(
         testing_type=TestingMode.BEHAVIOR,
         test_env=test_env,
         test_files=self.test_files,
         optimization_iteration=0,
         testing_time=TOTAL_LOOPING_TIME_EFFECTIVE,
         enable_coverage=False,
     )
 
     failing_test_names = [
         result.id.test_function_name
         for result in behavioral_results
         if (result.test_type == TestType.GENERATED_REGRESSION and not result.did_pass)
     ]
 
     if not failing_test_names:
         logger.info("All generated tests pass ✅")
+        return
 
     function_patterns = _compile_function_patterns(failing_test_names)
Suggestion importance[1-10]: 6

__

Why: Early-returning when failing_test_names is empty avoids compiling and scanning files unnecessarily; it's a reasonable optimization with low risk and moderate impact.

Low
General
Avoid unnecessary file writes

Ensure you only write files when the content actually changes to avoid unnecessary
writes and potential file timestamp churn. This also reduces risk if
remove_tests_from_source returns the same content.

codeflash/optimization/function_optimizer.py [1588-1600]

 for test_file in self.test_files.test_files:
     if test_file.test_type != TestType.GENERATED_REGRESSION:
         continue
 
     for file_path in [test_file.instrumented_behavior_file_path, test_file.benchmarking_file_path]:
         if file_path and file_path.exists():
             source = file_path.read_text(encoding="utf-8")
-            source = self.remove_tests_from_source(source, function_patterns)
-            file_path.write_text(source, encoding="utf-8")
+            updated = self.remove_tests_from_source(source, function_patterns)
+            if updated != source:
+                file_path.write_text(updated, encoding="utf-8")
 
     if test_file.original_source:
-        test_file.original_source = self.remove_tests_from_source(test_file.original_source, function_patterns)
+        updated_original = self.remove_tests_from_source(test_file.original_source, function_patterns)
+        if updated_original != test_file.original_source:
+            test_file.original_source = updated_original
Suggestion importance[1-10]: 5

__

Why: Writing only on change reduces I/O and timestamp churn; it's accurate and improves efficiency but is not critical to correctness.

Low

Previous suggestions

Suggestions up to commit 5933149
CategorySuggestion                                                                                                                                    Impact
General
Fix mismatch in displayed metrics

perf_improvement_line uses speedup_pct and speedup_x strings, which are based on
runtime even when speedup() returns throughput-based values. Align the displayed
strings with the metric chosen by speedup() to avoid misleading output.

codeflash/result/explanation.py [32-55]

 @property
 def perf_improvement_line(self) -> str:
-    # speedup property already handles choosing between runtime and throughput
-    return f"{self.speedup_pct} improvement ({self.speedup_x} faster)."
-
-@property
-def speedup(self) -> float:
-    runtime_improvement = (self.original_runtime_ns / self.best_runtime_ns) - 1
-
-    # Use throughput improvement if we have async metrics and throughput is better
+    # Align displayed strings with chosen metric
     if (
         self.original_async_throughput is not None
         and self.best_async_throughput is not None
         and self.original_async_throughput > 0
     ):
+        runtime_improvement = (self.original_runtime_ns / self.best_runtime_ns) - 1
         throughput_improvement = throughput_gain(
-            original_throughput=self.original_async_throughput, optimized_throughput=self.best_async_throughput
+            original_throughput=self.original_async_throughput,
+            optimized_throughput=self.best_async_throughput,
         )
+        if throughput_improvement > runtime_improvement or runtime_improvement <= 0:
+            throughput_pct = f"{throughput_improvement * 100:,.0f}%"
+            throughput_x = f"{throughput_improvement + 1:,.2f}x"
+            return f"{throughput_pct} improvement ({throughput_x} faster)."
+    return f"{self.speedup_pct} improvement ({self.speedup_x} faster)."
 
-        # Use throughput metrics if throughput improvement is better or runtime got worse
-        if throughput_improvement > runtime_improvement or runtime_improvement <= 0:
-            return throughput_improvement
-
-    return runtime_improvement
-
Suggestion importance[1-10]: 8

__

Why: The existing code uses runtime-based strings even when throughput governs speedup; aligning perf_improvement_line with throughput when appropriate fixes misleading output and matches the PR’s throughput-focused changes.

Medium
Possible issue
Refresh test registry after edits

After mutating test files in remove_failing_tests, update self.test_files to reflect
removed tests; otherwise subsequent runs may still attempt to execute now-deleted
tests, causing file-not-found or stale-source issues. Filter out removed functions
from in-memory sources/structures.

codeflash/optimization/function_optimizer.py [1621-1623]

 if failing_test_names:
     logger.info(f"Removed {len(failing_test_names)} failing generated test(s) ❌")
+    # Refresh in-memory test registry to avoid stale entries
+    self.test_files = self.test_files.__class__(
+        test_files=[
+            tf for tf in self.test_files.test_files
+            if tf.test_type != TestType.GENERATED_REGRESSION
+            or not any(pat.search(tf.original_source or "") for pat in function_patterns)
+        ]
+    )
Suggestion importance[1-10]: 6

__

Why: The existing lines match, and after mutating files it’s reasonable to refresh in-memory test_files to avoid stale entries; however, the proposed reconstruction of self.test_files assumes constructor details and may not fully align with the class design.

Low
Defer failing-tests removal

Running test deletion before establishing the baseline can skew baselines by
removing failing generated tests that might have surfaced issues in the original
code. Move this call to after the baseline is established to ensure metrics and
comparisons remain consistent and fair.

codeflash/optimization/function_optimizer.py [417-419]

-self.remove_failing_tests()
 console.rule()
 
+baseline_setup_result = self.setup_and_establish_baseline(
+    code_context=code_context,
+    original_helper_code=original_helper_code,
+    function_to_concolic_tests=function_to_concolic_tests,
+    ...
+)
+
+self.remove_failing_tests()
+
Suggestion importance[1-10]: 4

__

Why: The lines exist and moving remove_failing_tests() after establishing the baseline could prevent skewing baseline metrics, but the PR intent seems to clean flaky generated tests early and no concrete bug is shown; impact is moderate and context-dependent.

Low

@github-actions
Copy link

Persistent review updated to latest commit cef9891

]

if not failing_test_names:
logger.info("All generated tests pass ✅")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this message is not necessary

from codeflash.code_utils.edit_generated_tests import _compile_function_patterns

test_env = self.get_test_env(codeflash_loop_index=0, codeflash_test_iteration=0, codeflash_tracer_disable=1)
behavioral_results, _ = self.run_and_parse_tests(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cant run tests again. we already have this info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have this info, I'm doing this before the behavioral and performance tests run

@KRRT7 KRRT7 closed this Nov 11, 2025
@KRRT7 KRRT7 deleted the async-improvs branch November 11, 2025 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants