Skip to content

Conversation

PaliC
Copy link
Collaborator

@PaliC PaliC commented Mar 12, 2025

This PR adds pass@k analysis for correctness (+ compilation).

Usage examples:

# Add num_samples to `scripts/generate_samples.py`
python3 scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0 num_samples=10

# Add num_samples_per_problem and pass_at_k_values to `scripts/eval_from_generations.py`
python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300 num_samples_per_problem=10 pass_at_k_values=[1,2,5,10]

# Run `scripts/eval_from_generations.py` normally
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch

Note: For most analysis outside of correctness in pass@k, only the first sample is used.

The output can be seen in scripts/eval_from_generations.py as:

Evaluation metadata: {'total_problems': 100, 'problems_with_samples': 100, 'total_evaluated_samples': 994, 'total_correct_samples': 129, 'pass@1_count': 8, 'pass@2_count': 10, 'pass@5_count': 13, 'pass@10_count': 16}
Average pass@k Correctness metrics: {'avg_pass@1': 0.13114285714285714, 'avg_pass@2': 0.14752380952380953, 'avg_pass@5': 0.16190476190476188, 'avg_pass@10': 0.16}

Or in the output of scripts/benchmark_eval_analysis.py as:

Pass@k Correctness Metrics:

Evaluation Metadata:
+-------------------------+---------+
| Metric                  |   Value |
+=========================+=========+
| total_problems          |     100 |
+-------------------------+---------+
| problems_with_samples   |     100 |
+-------------------------+---------+
| total_evaluated_samples |     994 |
+-------------------------+---------+
| total_correct_samples   |     129 |
+-------------------------+---------+
| pass@1_count            |     100 |
+-------------------------+---------+
| pass@2_count            |     100 |
+-------------------------+---------+
| pass@5_count            |     100 |
+-------------------------+---------+
| pass@10_count           |      98 |
+-------------------------+---------+

Average Pass@k Metrics:
+-------------+----------+
| Metric      |    Value |
+=============+==========+
| avg_pass@1  | 0.131143 |
+-------------+----------+
| avg_pass@2  | 0.147524 |
+-------------+----------+
| avg_pass@5  | 0.161905 |
+-------------+----------+
| avg_pass@10 | 0.16     |

Also enjoy the copilot summary below :)

This pull request includes significant updates to the scripts/benchmark_eval_analysis.py and scripts/eval_from_generations.py files to enhance the evaluation and analysis processes. Key changes include reorganization of imports, addition of pass@k analysis, and improvements to error handling and formatting.

Enhancements to scripts/benchmark_eval_analysis.py:

  • Reorganized imports for better readability and consistency.
  • Added pass@k analysis to analyze_greedy_eval function, including checks for pass@k results and displaying metrics if available. [1] [2]
  • Improved formatting and readability of the code by breaking long lines and adding necessary spacing. [1] [2] [3]

Enhancements to scripts/eval_from_generations.py:

  • Reorganized imports and added new imports for multiprocessing, collections, and numpy.
  • Added configuration options for pass@k analysis, including the number of samples per problem and list of k values.
  • Improved error handling and formatting in various functions, including fetch_ref_arch_from_problem_id, fetch_kernel_from_disk, and evaluate_single_sample. [1] [2] [3] [4]
  • Added detailed print statements for better debugging and tracking of evaluation results. [1] [2] [3] [4]

@PaliC PaliC requested a review from simonguozirui March 13, 2025 17:01
@simonguozirui simonguozirui mentioned this pull request Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant