Add pass@k #33

PaliC · 2025-03-12T19:18:28Z

This PR adds pass@k analysis for correctness (+ compilation).

Usage examples:

# Add num_samples to `scripts/generate_samples.py`
python3 scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0 num_samples=10

# Add num_samples_per_problem and pass_at_k_values to `scripts/eval_from_generations.py`
python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300 num_samples_per_problem=10 pass_at_k_values=[1,2,5,10]

# Run `scripts/eval_from_generations.py` normally
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch

Note: For most analysis outside of correctness in pass@k, only the first sample is used.

The output can be seen in scripts/eval_from_generations.py as:

Evaluation metadata: {'total_problems': 100, 'problems_with_samples': 100, 'total_evaluated_samples': 994, 'total_correct_samples': 129, 'pass@1_count': 8, 'pass@2_count': 10, 'pass@5_count': 13, 'pass@10_count': 16}
Average pass@k Correctness metrics: {'avg_pass@1': 0.13114285714285714, 'avg_pass@2': 0.14752380952380953, 'avg_pass@5': 0.16190476190476188, 'avg_pass@10': 0.16}

Or in the output of scripts/benchmark_eval_analysis.py as:

Pass@k Correctness Metrics:

Evaluation Metadata:
+-------------------------+---------+
| Metric                  |   Value |
+=========================+=========+
| total_problems          |     100 |
+-------------------------+---------+
| problems_with_samples   |     100 |
+-------------------------+---------+
| total_evaluated_samples |     994 |
+-------------------------+---------+
| total_correct_samples   |     129 |
+-------------------------+---------+
| pass@1_count            |     100 |
+-------------------------+---------+
| pass@2_count            |     100 |
+-------------------------+---------+
| pass@5_count            |     100 |
+-------------------------+---------+
| pass@10_count           |      98 |
+-------------------------+---------+

Average Pass@k Metrics:
+-------------+----------+
| Metric      |    Value |
+=============+==========+
| avg_pass@1  | 0.131143 |
+-------------+----------+
| avg_pass@2  | 0.147524 |
+-------------+----------+
| avg_pass@5  | 0.161905 |
+-------------+----------+
| avg_pass@10 | 0.16     |

Also enjoy the copilot summary below :)

This pull request includes significant updates to the scripts/benchmark_eval_analysis.py and scripts/eval_from_generations.py files to enhance the evaluation and analysis processes. Key changes include reorganization of imports, addition of pass@k analysis, and improvements to error handling and formatting.

Enhancements to `scripts/benchmark_eval_analysis.py`:

Reorganized imports for better readability and consistency.
Added pass@k analysis to analyze_greedy_eval function, including checks for pass@k results and displaying metrics if available. [1] [2]
Improved formatting and readability of the code by breaking long lines and adding necessary spacing. [1] [2] [3]

Enhancements to `scripts/eval_from_generations.py`:

Reorganized imports and added new imports for multiprocessing, collections, and numpy.
Added configuration options for pass@k analysis, including the number of samples per problem and list of k values.
Improved error handling and formatting in various functions, including fetch_ref_arch_from_problem_id, fetch_kernel_from_disk, and evaluate_single_sample. [1] [2] [3] [4]
Added detailed print statements for better debugging and tracking of evaluation results. [1] [2] [3] [4]

PaliC and others added 7 commits March 12, 2025 12:18

Add pass@k

9695b04

eval

e499e99

mostly working

ee485bb

eval

70889a8

fixed configs

d54c9b1

fixed configs

1541ae2

Discard changes to scripts/generate_baseline_time.py

04b2829

PaliC requested a review from simonguozirui March 13, 2025 17:01

simonguozirui mentioned this pull request Mar 25, 2025

Case study question #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pass@k #33

Add pass@k #33

Uh oh!

PaliC commented Mar 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add pass@k #33

Are you sure you want to change the base?

Add pass@k #33

Uh oh!

Conversation

PaliC commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancements to scripts/benchmark_eval_analysis.py:

Enhancements to scripts/eval_from_generations.py:

Uh oh!

Uh oh!

PaliC commented Mar 12, 2025 •

edited

Loading

Enhancements to `scripts/benchmark_eval_analysis.py`:

Enhancements to `scripts/eval_from_generations.py`: