bench mark scripts #2525

ved1beta · 2025-04-30T09:14:57Z

this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think

BenjaminBossan · 2025-04-30T15:05:51Z

Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results (benchmark_result.json)? We will run the experiments on our hardware to get reproducible results.

BenjaminBossan

Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.

Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info, train_info, and meta_info. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.

We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.

One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?

Also, please add a README.md and run make style before pushing your changes.

method_comparison/peft_bench/data.py

method_comparison/peft_bench/utils.py

ved1beta · 2025-05-05T09:52:25Z

tried to cover all the changes please have a look : )

BenjaminBossan

Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.

method_comparison/peft_bench/run.py

method_comparison/peft_bench/utils.py

method_comparison/peft_bench/run.py

method_comparison/peft_bench/data.py

method_comparison/peft_bench/README.md

method_comparison/peft_bench/experiments/lora/lora_r16/benchmark_params.json

ved1beta · 2025-05-05T16:06:51Z

i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that

ved1beta · 2025-05-05T17:20:40Z

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose
need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

githubnemo · 2025-05-12T18:36:44Z

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

Can you be a bit more specific on what import errors you're experiencing?

ved1beta · 2025-05-13T03:09:40Z

@githubnemo
The import errors occur because the script uses relative imports (from data import ... and from utils import ...) but these modules are in the same directory. we can either:

Use absolute imports:

from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...

Or add the directory to the Python path in the script:

import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))

cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose

hey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : )

githubnemo

Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.

Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench folder - every other folder is unsupported, making the code a bit simpler, I think.

I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.

method_comparison/peft_bench/run.py

method_comparison/peft_bench/utils.py

method_comparison/peft_bench/configs/sample_config.json

method_comparison/peft_bench/experiments/lora/lora_r16/adapter_config.json

method_comparison/peft_bench/prompts.json

method_comparison/peft_bench/configs/sample_config.json

method_comparison/peft_bench/data.py

method_comparison/peft_bench/run.py

ved1beta · 2025-05-24T04:27:15Z

did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know

Co-authored-by: Benjamin Bossan <[email protected]>

ved1beta · 2025-06-06T13:32:37Z

i know i have not covered all of the reviews due to their large number, and now all of them are mixed up solved/unsolved
i would still love to complete this, just let me know what is expected
thanks

githubnemo

Thank you so much! You covered a lot already but I get it, there are a lot of comments. I went over all comments and marked quite a few old ones from previous reviews as resolved - those that are still open should have a comment stating why they are still open or are discussed in this comment.

Here's an attempt to list the major things that are still open (the unresolved comments are still valid, as far as I can see, but let's focus on the big things first):

prompt categories: As Benjamin already commented it is probably best to run the benchmark across all prompt categories but track the metrics separately so that the results are always comparable (i.e. it is the same task for each experiment but the model parameters differ). See this comment. This ties in to how we will manage configs, I think.

config arrangement: Using a default config for each experiment and just configuring the difference is still unresolved, at least partially. See this comment for the general remark. Let me make a suggestion and let's discuss the best way of handling config. MetaMathQA has

(1) an adapter config and
(2) optionally training params

The equivalent for the PEFT bench task would be the (1) adapter config and (2) the benchmark config. The adapter config should be different for each experiment. The benchmark config should only differ if there's something that really needs changing, like the number of inference runs because the measurement error is still too high. But ideally only the adapter config changes between experiments so that every experiment outcome is comparable. This means that there will be one default config for each experiment (regarding the benchmark setup) and one adapter config for each experiment. This would also remove the need for the "peft_config_variant" field and there would be a clear separation between benchmark setup and adapter config. WDYT?

…chmark2scripts

ved1beta · 2025-06-06T20:10:30Z

I have covered all the changes mentioned in the recent replies , also about the configs we can take your approach just verify these changes and we can work on the other part

githubnemo

Yes, this looks good, thank you!

Individual results

Regarding the results, I was wondering about storing the individual results as well, alongside the averaged values. For example results["generation_config"]["by_category"]["short"]["samples"][0] = {"inference_time": .., "generated_tokens": ..., etc.}. One of the scenarios I have in mind where this might be beneficial is, that you can get 20 generated tokens in a lot of ways, i.e. (35 + 5)/2 = (20+20)/2 = (30 + 10) / 2 = ... - you could then falsely claim that method X is slower on average but in reality there may have been a bug which generated one very long sequence and very short ones after (and this is important since runtime is not necessarily linear, especially for transformers). If we store the individual results as well then we could easily spot that the sequence length distribution is off.

Meta Info

I was comparing the result files of the peft_bench and MetaMathQA and I found the meta_info to be a lot richer. From the README:

"meta_info": {
  "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
  "model_created_at": "2024-09-18T15:23:48+00:00",
  "dataset_sha": "aa4f34d3d2d3231299b5b03d9b3e5a20da45aa18",
  "dataset_created_at": "2023-09-21T17:22:46+00:00",
  "package_info": {
    "transformers-version": "4.50.0.dev0",
    "transformers-commit-hash": "752ef3fd4e70869626ec70657a770a85c0ad9219",
    "peft-version": "0.14.1.dev0",
    "peft-commit-hash": "a447a4e5ecd87b7d57733f4df9616a328cf130f4",
    "datasets-version": "3.3.2",
    "datasets-commit-hash": null,
    "bitsandbytes-version": "0.45.2",
    "bitsandbytes-commit-hash": null,
    "torch-version": "2.6.0+cu124",
    "torch-commit-hash": null
  },
  "system_info": {
    "system": "Linux",
    "release": "6.11.0-17-generic",
    "version": "#17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2",
    "machine": "x86_64",
    "processor": "x86_64",
    "gpu": "NVIDIA GeForce RTX 4090"
  },
  "pytorch_info": "PyTorch built with: [...]"
}

I think it would make sense to include this in this benchmark as well. It would have the benefit of having all important system info in one place which makes it easier to pinpoint differences because of the environment. We could re-use the code from there. The info that currently resides in meta_info is more akin to run_info IMO. WDYT?

Config inclusion

I think that both the PEFT and the benchmark config should be included in the results so that the result is a representation of the benchmark state (doesn't cost much and is way easier to handle). Something like results["run_info"]["peft_config"] = <peft adapter config> and results["run_info"]["benchmark_config"] = <benchmark config>.

method_comparison/peft_bench/README.md

githubnemo · 2025-06-10T13:07:38Z

method_comparison/peft_bench/README.md

+- **Trainable Parameters**: Number of parameters being fine-tuned
+- **Parameter Ratio**: Percentage of total parameters being trained


There's no training, what do you mean by training here?

i dont remember ,removed

method_comparison/peft_bench/README.md

method_comparison/peft_bench/configs/sample_config.json

githubnemo · 2025-06-11T09:53:39Z

method_comparison/peft_bench/default_benchmark_params.json

@@ -0,0 +1,11 @@
+{
+    "model_id": "facebook/opt-350m",


I think it makes sense, model-wise, to use a more modern and popular model like Llama-3.2-3B instead of OPT.

githubnemo · 2025-06-11T09:56:20Z

method_comparison/peft_bench/default_benchmark_params.json

+    "num_inference_runs": 10,
+    "train_batch_size": 2,
+    "train_steps": 3,
+    "num_prompt_samples": 1


I think we decided to use all prompts instead of samplig / using a subset, can we remove the num_prompt_samples parameter?

githubnemo · 2025-06-11T09:56:50Z

method_comparison/peft_bench/default_benchmark_params.json

+    "train_batch_size": 2,
+    "train_steps": 3,


Suggested change

"train_batch_size": 2,

"train_steps": 3,

Can be removed?

method_comparison/peft_bench/default_benchmark_params.json

…,inputs outof loop

ved1beta · 2025-07-26T10:00:13Z

thanks for the detailed review and keeping up with the PR <3
covered all the comments except one
Please close the comments you think are resolved

about the first point NAME i am not sure it could me anything like peft inference, fast peft, inference bench as you guys wish ,
added config for max new tokens depending on the prompt length\
lets cover both of these in a separate PR
done

BenjaminBossan

Thanks for the updates. I still discovered some issues, most notably I couldn't run the run_base.py script successfully, could you please check?

about the first point NAME i am not sure it could me anything like peft inference, fast peft, inference bench as you guys wish ,

How about "Text Generation Benchmark" (directory: text_generation_benchmark)?

added config for max new tokens depending on the prompt length

The differences are now noticeable.

About 3. and 4., okay, let's work on that later.

BenjaminBossan · 2025-07-29T10:55:37Z

method_comparison/peft_bench/README.md

+       base_model_name_or_path="facebook/opt-350m",
+       bias="none",
+       fan_in_fan_out=False,
+       inference_mode=False,
+       init_lora_weights=True,
+       lora_alpha=16,
+       lora_dropout=0.1,
+       modules_to_save=None,
+       peft_type="LORA",


Let's remove those arguments, they don't really add anything for the purpose of this benchmark.

BenjaminBossan · 2025-07-29T10:56:21Z

method_comparison/peft_bench/utils.py

-                print_fn(f"  {metric_name.replace('_', ' ').title()}: {value}")
+    overall = benchmark_result.generation_info.get("overall")
+    if overall:
+        # Use the same formatting as in Detailed Metrics


Remove the comment

BenjaminBossan · 2025-07-29T10:56:42Z

method_comparison/peft_bench/run.py

-            print_fn("Computing base model inference times...")
-
-        # Prepare benchmark prompts
+               # Prepare benchmark prompts


BenjaminBossan · 2025-07-29T10:59:19Z

method_comparison/peft_bench/run_base.py

+
+"""
+Script to measure base model inference times and save results for reuse.
+This should be run once per model configuration to avoid redundant computations.


Let's add this info also to the README

BenjaminBossan · 2025-07-29T11:11:40Z

method_comparison/peft_bench/run.py

+            print_fn("Using cached base model results...")
+            base_inference_times = base_results["inference_results"]
+        else:
+            print_fn("No cached base results found. Please run run_base.py first.")


I think we can just raise an error here and require the base results to be computed first. Otherwise, there is too much risk of missing this message and then, when running dozens of script, we re-compute the base results dozens of times.

BenjaminBossan · 2025-07-29T11:14:23Z

method_comparison/peft_bench/run_base.py

+        import traceback
+
+        traceback.print_exc()


I don't think we need this, just remove the whole try ... except to get the same result.

BenjaminBossan · 2025-07-29T11:16:56Z

method_comparison/peft_bench/run_base.py

+def main():
+    """Main entry point for the base model benchmark runner."""
+    parser = argparse.ArgumentParser(description="Run base model benchmarks")
+    parser.add_argument("experiment_path", help="Path to experiment directory (to load config)")


I don't understand why we have to pass the experiment path here, should this not always use the default benchmark settings?

BenjaminBossan · 2025-07-29T11:21:21Z

method_comparison/peft_bench/run.py

+            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+            # Get category-specific max_new_tokens
+            cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens))


When category_generation_params is None, this line fails.

how this could be none ?

ved1beta · 2025-07-29T14:31:38Z

text_generation_benchmark works covered all the changes except the last one we can discuss that above .

BenjaminBossan

Thanks for the updates, but I still have trouble creating the base benchmark results. I think if you follow my comments, the problem should be fixed. Ideally, before you push your final changes, you should test locally if the scripts pass successfully. Also, please run ruff on this directory.

method_comparison/text_generation_benchmark/run.py

BenjaminBossan · 2025-07-31T12:09:36Z

method_comparison/text_generation_benchmark/run.py

+            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+
+            # Get category-specific max_new_tokens
+            cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens))


how this could be none ?

In run_base.py, category_generation_params are not passed, so the default value of None is used. Try running python run_base.py and you should encounter the error. Let's remove the default value for category_generation_params and instead require the dict to always be passed.

BenjaminBossan · 2025-07-31T12:12:45Z

method_comparison/text_generation_benchmark/README.md

+
+**Usage:**
+```bash
+python run_base.py experiments/lora/lora_r8


Calling the script like this is not possible, as the argument parser does not recognize the argument. Also, the base model results should not depend on any specific experiment. Instead, it should read the default_benchmark_params.json.

Co-authored-by: Benjamin Bossan <[email protected]>

ved1beta · 2025-07-31T13:59:14Z

thanks for the review , covered all the changes

Ideally, before you push your final changes, you should test locally if the scripts pass successfully. Also, please run ruff on this directory.

will keep this in mind 🫡

BenjaminBossan

Thanks for the updates, from my point of view, not much is missing. I had one more comment on the code, apart from that:

Please create the required empty result directories (results/ etc.) and place a .gitkeep inside, then add them.
Just a suggestion: How about running prompts in reverse order, i.e. starting with the longest first? That way, the most likely setting to OOM runs first, wasting less of the users time in case it happens.

BenjaminBossan · 2025-08-01T14:36:51Z

method_comparison/text_generation_benchmark/run_base.py

+    # Configure print function based on verbosity
+    print_fn = print if args.verbose else lambda *args, **kwargs: None
+
+    default_experiment_path = "experiments/lora/lora_r8"


Let's remove this argument. For the base model, I don't think it makes sense to call validate_experiment_path at all.

ved1beta · 2025-08-01T16:51:09Z

Just a suggestion: How about running prompts in reverse order, i.e. starting with the longest first? That way, the most likely setting to OOM runs first, wasting less of the users time in case it happens.

can we do this in a separate pr ?

BenjaminBossan · 2025-08-05T10:11:18Z

can we do this in a separate pr ?

Okay, that works. Could you please create the empty folders for RESULT_PATH_TEMP and RESULT_PATH_CANCELLED too?

BenjaminBossan

Thank you for adding the text generation benchmark to the PEFT method comparison suite and for your patience with the review. I think this is now in a good enough state that it can be merged.

Just so that we don't forget, there are still a few todos for this task, listed below:

Add config files for all PEFT methods (can be mostly based on the existing ones for MetaMathQA, using the default settings).
Running the prompt groups in reverse order, i.e. longest first, to hit potential OOMs as early as possible.
Make the code device agnostic, like in #2610.
Add a Makefile similar to what we have in MetaMathQA.
Update the method_comparison/app.py to include the results from this benchmark.

HuggingFaceDocBuilderDev · 2025-08-06T15:30:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2025-08-14T09:24:03Z

Note: The task of making the benchmark more device agnostic has already been taken care of in #2730.

bench mark scripts

4d03d65

json removed

18733f5

BenjaminBossan requested changes May 2, 2025

View reviewed changes

ved1beta added 2 commits May 5, 2025 15:14

required changes

b31236e

read me commad fix '

e65b2be

BenjaminBossan requested changes May 5, 2025

View reviewed changes

required changes

627a038

ved1beta force-pushed the benchmark2scripts branch from ea4bb46 to 627a038 Compare May 5, 2025 17:15

configs

2de1796

ved1beta added 2 commits May 5, 2025 22:53

readme

03dc404

import handler

ec00ec6

githubnemo requested changes May 23, 2025

View reviewed changes

required chnages

9b476f2

base model name/path

92e1d64

Co-authored-by: Benjamin Bossan <[email protected]>

githubnemo reviewed Jun 6, 2025

View reviewed changes

ved1beta added 5 commits June 7, 2025 01:22

defualt json

4b67c62

ruff

338d9ad

format

75fdd77

updated read me

04069d8

Merge branch 'benchmark2scripts' of github.com:ved1beta/peft into ben…

bc6e729

…chmark2scripts

githubnemo reviewed Jun 11, 2025

View reviewed changes

ved1beta added 9 commits July 26, 2025 10:21

to dict conversion removed

ff7114d

requested changes include:remove traceback,optional added,dtype error…

12edab9

…,inputs outof loop

run_base integreation

b3858fc

required read me changes

7002a81

Overall Metrics format

2611c67

changed branch logic to match MetaMathQA

d58658d

removed

ecbcdba

ruff

4780ae5

max token vary 20-50-100

7b54782

BenjaminBossan requested changes Jul 29, 2025

View reviewed changes

ved1beta added 2 commits July 29, 2025 19:51

requested chanegs

ecc1382

text_generation_benchmark

b10f873

BenjaminBossan requested changes Jul 31, 2025

View reviewed changes

ved1beta and others added 2 commits July 31, 2025 19:09

Update method_comparison/text_generation_benchmark/run.py

7581b1d

Co-authored-by: Benjamin Bossan <[email protected]>

category_generation_params added to run_base + additional

f3f992d

BenjaminBossan requested changes Aug 1, 2025

View reviewed changes

ved1beta added 3 commits August 1, 2025 21:40

style comments

02c0872

.git keep added

a3be879

ruff

c309e02

added cancelled_results/ temporary_results/

2f9a537

BenjaminBossan approved these changes Aug 6, 2025

View reviewed changes

githubnemo approved these changes Aug 7, 2025

View reviewed changes

BenjaminBossan merged commit ec5a1c6 into huggingface:main Aug 7, 2025
2 of 14 checks passed

		- Trainable Parameters: Number of parameters being fine-tuned
		- Parameter Ratio: Percentage of total parameters being trained

bench mark scripts #2525

bench mark scripts #2525

Uh oh!

Conversation

ved1beta commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Apr 30, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

ved1beta commented May 5, 2025

Uh oh!

githubnemo commented May 12, 2025

Uh oh!

ved1beta commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ved1beta commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ved1beta commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

ved1beta commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Individual results

Meta Info

Config inclusion

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ved1beta commented Apr 30, 2025 •

edited

Loading

ved1beta commented May 13, 2025 •

edited

Loading

ved1beta commented May 24, 2025 •

edited

Loading

ved1beta commented Jun 6, 2025 •

edited

Loading

ved1beta commented Jun 6, 2025 •

edited

Loading

ved1beta Jun 26, 2025 •

edited

Loading

ved1beta commented Jul 26, 2025 •

edited

Loading