remove failing generated tests #894

KRRT7 · 2025-11-10T22:11:15Z

PR Type

Enhancement, Bug fix

Description

Auto-remove failing generated regression tests
Add async safety by pruning bad tests early
Insert console rule after cleanup step
Minor typing: import Pattern for regex use

Diagram Walkthrough

flowchart LR
  setup["Test setup completed"] -- "run behavior tests" --> run["Run and parse tests"]
  run -- "collect failing generated tests" --> collect["Failing test names"]
  collect -- "compile regex patterns" --> compile["Compile function patterns"]
  compile -- "remove from sources" --> apply["Delete tests in files"]
  apply -- "log and continue" --> proceed["Establish baseline & optimize"]

File Walkthrough

Relevant files

Enhancement

function_optimizer.py `Pre-baseline pruning of failing generated tests` codeflash/optimization/function_optimizer.py Add `remove_failing_tests` to prune bad generated tests Add `remove_tests_from_source` with regex-based deletion Invoke removal before establishing baseline; add console rule Add TYPE_CHECKING import for `Pattern`	+56/-0

github-actions · 2025-11-10T22:12:23Z

PR Reviewer Guide 🔍

(Review updated until commit `cef9891`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue When no failing tests are found, function patterns are still compiled and removal loops run; also the info log says all generated tests pass but proceeds silently to attempt removals. Consider returning early to avoid unnecessary file reads/writes and keep logs consistent. if not failing_test_names: logger.info("All generated tests pass ✅") function_patterns = _compile_function_patterns(failing_test_names) for test_file in self.test_files.test_files: if test_file.test_type != TestType.GENERATED_REGRESSION: continue for file_path in [test_file.instrumented_behavior_file_path, test_file.benchmarking_file_path]: if file_path and file_path.exists(): source = file_path.read_text(encoding="utf-8") source = self.remove_tests_from_source(source, function_patterns) file_path.write_text(source, encoding="utf-8") if test_file.original_source: test_file.original_source = self.remove_tests_from_source(test_file.original_source, function_patterns) if failing_test_names: Robustness The test-name-to-pattern compilation and subsequent regex-based deletion may match inside parametrized tests or non-test code segments; while there is a guard for '@pytest.mark.parametrize', other decorators or multiline definitions may still be altered inadvertently. Consider anchoring patterns more strictly or parsing with AST. def remove_tests_from_source(self, source: str, function_patterns: list[Pattern]) -> str: for pattern in function_patterns: match = pattern.search(source) while match: if "@pytest.mark.parametrize" in match.group(0): match = pattern.search(source, match.end()) continue start, end = match.span() source = source[:start] + source[end:] match = pattern.search(source, start) return source Missing Import Use TYPE_CHECKING import of Pattern suggests type usage; ensure the annotation 'list[Pattern]' refers to 're.Pattern' at runtime or add a typing import fallback to avoid NameError in type checking or runtime tools that evaluate annotations. if TYPE_CHECKING: from argparse import Namespace from re import Pattern from codeflash.discovery.functions_to_optimize import FunctionToOptimize from codeflash.either import Result

github-actions · 2025-11-10T22:12:46Z

PR Code Suggestions ✨

Latest suggestions up to cef9891
Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix potential matching loop Prevent infinite loops and partial deletions by advancing the index on the unchanged source when skipping parametrized tests. Also guard against overlapping patterns by always moving the cursor forward after non-mutating matches. codeflash/optimization/function_optimizer.py [1604-1614] def remove_tests_from_source(self, source: str, function_patterns: list[Pattern]) -> str: for pattern in function_patterns: - match = pattern.search(source) - while match: + pos = 0 + while True: + match = pattern.search(source, pos) + if not match: + break if "@pytest.mark.parametrize" in match.group(0): - match = pattern.search(source, match.end()) + # Advance past this match to avoid re-matching at the same position + pos = match.end() continue start, end = match.span() source = source[:start] + source[end:] - match = pattern.search(source, start) + # Continue searching from the deletion point + pos = start return source Suggestion importance[1-10]: 7 __ Why: Adjusting the search position prevents re-matching on skipped parametrized cases and reduces risk of looping; this is a targeted robustness improvement aligned with the existing logic.	Medium
Possible issue	Add early return guard Return early when there are no failing tests to avoid compiling empty patterns and mutating files unnecessarily. This prevents accidental edits and redundant work when `failing_test_names` is empty. codeflash/optimization/function_optimizer.py [1564-1587] def remove_failing_tests(self) -> None: from codeflash.code_utils.edit_generated_tests import _compile_function_patterns test_env = self.get_test_env(codeflash_loop_index=0, codeflash_test_iteration=0, codeflash_tracer_disable=1) behavioral_results, _ = self.run_and_parse_tests( testing_type=TestingMode.BEHAVIOR, test_env=test_env, test_files=self.test_files, optimization_iteration=0, testing_time=TOTAL_LOOPING_TIME_EFFECTIVE, enable_coverage=False, ) failing_test_names = [ result.id.test_function_name for result in behavioral_results if (result.test_type == TestType.GENERATED_REGRESSION and not result.did_pass) ] if not failing_test_names: logger.info("All generated tests pass ✅") + return function_patterns = _compile_function_patterns(failing_test_names) Suggestion importance[1-10]: 6 __ Why: Early-returning when `failing_test_names` is empty avoids compiling and scanning files unnecessarily; it's a reasonable optimization with low risk and moderate impact.	Low
General	Avoid unnecessary file writes Ensure you only write files when the content actually changes to avoid unnecessary writes and potential file timestamp churn. This also reduces risk if `remove_tests_from_source` returns the same content. codeflash/optimization/function_optimizer.py [1588-1600] for test_file in self.test_files.test_files: if test_file.test_type != TestType.GENERATED_REGRESSION: continue for file_path in [test_file.instrumented_behavior_file_path, test_file.benchmarking_file_path]: if file_path and file_path.exists(): source = file_path.read_text(encoding="utf-8") - source = self.remove_tests_from_source(source, function_patterns) - file_path.write_text(source, encoding="utf-8") + updated = self.remove_tests_from_source(source, function_patterns) + if updated != source: + file_path.write_text(updated, encoding="utf-8") if test_file.original_source: - test_file.original_source = self.remove_tests_from_source(test_file.original_source, function_patterns) + updated_original = self.remove_tests_from_source(test_file.original_source, function_patterns) + if updated_original != test_file.original_source: + test_file.original_source = updated_original Suggestion importance[1-10]: 5 __ Why: Writing only on change reduces I/O and timestamp churn; it's accurate and improves efficiency but is not critical to correctness.	Low

Previous suggestions

Suggestions up to commit 5933149

Category	Suggestion	Impact
General	Fix mismatch in displayed metrics `perf_improvement_line` uses `speedup_pct` and `speedup_x` strings, which are based on runtime even when `speedup()` returns throughput-based values. Align the displayed strings with the metric chosen by `speedup()` to avoid misleading output. codeflash/result/explanation.py [32-55] @property def perf_improvement_line(self) -> str: - # speedup property already handles choosing between runtime and throughput - return f"{self.speedup_pct} improvement ({self.speedup_x} faster)." - -@property -def speedup(self) -> float: - runtime_improvement = (self.original_runtime_ns / self.best_runtime_ns) - 1 - - # Use throughput improvement if we have async metrics and throughput is better + # Align displayed strings with chosen metric if ( self.original_async_throughput is not None and self.best_async_throughput is not None and self.original_async_throughput > 0 ): + runtime_improvement = (self.original_runtime_ns / self.best_runtime_ns) - 1 throughput_improvement = throughput_gain( - original_throughput=self.original_async_throughput, optimized_throughput=self.best_async_throughput + original_throughput=self.original_async_throughput, + optimized_throughput=self.best_async_throughput, ) + if throughput_improvement > runtime_improvement or runtime_improvement <= 0: + throughput_pct = f"{throughput_improvement * 100:,.0f}%" + throughput_x = f"{throughput_improvement + 1:,.2f}x" + return f"{throughput_pct} improvement ({throughput_x} faster)." + return f"{self.speedup_pct} improvement ({self.speedup_x} faster)." - # Use throughput metrics if throughput improvement is better or runtime got worse - if throughput_improvement > runtime_improvement or runtime_improvement <= 0: - return throughput_improvement - - return runtime_improvement - Suggestion importance[1-10]: 8 __ Why: The existing code uses runtime-based strings even when throughput governs `speedup`; aligning `perf_improvement_line` with throughput when appropriate fixes misleading output and matches the PR’s throughput-focused changes.	Medium
Possible issue	Refresh test registry after edits After mutating test files in `remove_failing_tests`, update `self.test_files` to reflect removed tests; otherwise subsequent runs may still attempt to execute now-deleted tests, causing file-not-found or stale-source issues. Filter out removed functions from in-memory sources/structures. codeflash/optimization/function_optimizer.py [1621-1623] if failing_test_names: logger.info(f"Removed {len(failing_test_names)} failing generated test(s) ❌") + # Refresh in-memory test registry to avoid stale entries + self.test_files = self.test_files.__class__( + test_files=[ + tf for tf in self.test_files.test_files + if tf.test_type != TestType.GENERATED_REGRESSION + or not any(pat.search(tf.original_source or "") for pat in function_patterns) + ] + ) Suggestion importance[1-10]: 6 __ Why: The existing lines match, and after mutating files it’s reasonable to refresh in-memory `test_files` to avoid stale entries; however, the proposed reconstruction of `self.test_files` assumes constructor details and may not fully align with the class design.	Low
Possible issue	Defer failing-tests removal Running test deletion before establishing the baseline can skew baselines by removing failing generated tests that might have surfaced issues in the original code. Move this call to after the baseline is established to ensure metrics and comparisons remain consistent and fair. codeflash/optimization/function_optimizer.py [417-419] -self.remove_failing_tests() console.rule() +baseline_setup_result = self.setup_and_establish_baseline( + code_context=code_context, + original_helper_code=original_helper_code, + function_to_concolic_tests=function_to_concolic_tests, + ... +) + +self.remove_failing_tests() + Suggestion importance[1-10]: 4 __ Why: The lines exist and moving `remove_failing_tests()` after establishing the baseline could prevent skewing baseline metrics, but the PR intent seems to clean flaky generated tests early and no concrete bug is shown; impact is moderate and context-dependent.	Low

github-actions · 2025-11-10T22:16:45Z

Persistent review updated to latest commit cef9891

misrasaurabh1 · 2025-11-11T00:18:38Z

codeflash/optimization/function_optimizer.py

+        ]
+
+        if not failing_test_names:
+            logger.info("All generated tests pass ✅")


this message is not necessary

misrasaurabh1 · 2025-11-11T00:19:16Z

codeflash/optimization/function_optimizer.py

+        from codeflash.code_utils.edit_generated_tests import _compile_function_patterns
+
+        test_env = self.get_test_env(codeflash_loop_index=0, codeflash_test_iteration=0, codeflash_tracer_disable=1)
+        behavioral_results, _ = self.run_and_parse_tests(


we cant run tests again. we already have this info

we don't have this info, I'm doing this before the behavioral and performance tests run

github-actions bot added the Review effort 3/5 label Nov 10, 2025

KRRT7 closed this Nov 10, 2025

KRRT7 force-pushed the async-improvs branch from 5933149 to 6696f07 Compare November 10, 2025 22:15

squash

cef9891

KRRT7 reopened this Nov 10, 2025

github-actions bot added Review effort 2/5 and removed Review effort 3/5 labels Nov 10, 2025

misrasaurabh1 reviewed Nov 11, 2025

View reviewed changes

misrasaurabh1 requested changes Nov 11, 2025

View reviewed changes

KRRT7 closed this Nov 11, 2025

KRRT7 deleted the async-improvs branch November 11, 2025 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove failing generated tests #894

remove failing generated tests #894

KRRT7 commented Nov 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

misrasaurabh1 Nov 11, 2025

Uh oh!

misrasaurabh1 Nov 11, 2025

Uh oh!

KRRT7 Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

remove failing generated tests #894

remove failing generated tests #894

Conversation

KRRT7 commented Nov 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit cef9891)

Uh oh!

github-actions bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

misrasaurabh1 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

misrasaurabh1 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

KRRT7 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Nov 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Nov 10, 2025 •

edited

Loading

(Review updated until commit `cef9891`)

github-actions bot commented Nov 10, 2025 •

edited

Loading