- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.1k
 
bench mark scripts #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bench mark scripts #2525
Conversation
| 
           Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results (  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.
Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info, train_info, and meta_info. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.
We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.
One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?
Also, please add a README.md and run make style before pushing your changes.
| 
           tried to cover all the changes please have a look : )  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.
        
          
                method_comparison/peft_bench/experiments/lora/lora_r16/benchmark_params.json
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that  | 
    
ea4bb46    to
    627a038      
    Compare
  
    | 
           there are some import issue so we cant run it directly   | 
    
          
 Can you be a bit more specific on what import errors you're experiencing?  | 
    
| 
           @githubnemo 
 from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...
 import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbosehey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : )  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.
Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench folder - every other folder is unsupported, making the code a bit simpler, I think.
I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.
        
          
                method_comparison/peft_bench/experiments/lora/lora_r16/adapter_config.json
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
           did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know  | 
    
Co-authored-by: Benjamin Bossan <[email protected]>
| 
           i know i have not covered all of the reviews due to their large number, and now all of them are mixed up solved/unsolved  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! You covered a lot already but I get it, there are a lot of comments. I went over all comments and marked quite a few old ones from previous reviews as resolved - those that are still open should have a comment stating why they are still open or are discussed in this comment.
Here's an attempt to list the major things that are still open (the unresolved comments are still valid, as far as I can see, but let's focus on the big things first):
prompt categories: As Benjamin already commented it is probably best to run the benchmark across all prompt categories but track the metrics separately so that the results are always comparable (i.e. it is the same task for each experiment but the model parameters differ). See this comment. This ties in to how we will manage configs, I think.
config arrangement: Using a default config for each experiment and just configuring the difference is still unresolved, at least partially. See this comment for the general remark. Let me make a suggestion and let's discuss the best way of handling config. MetaMathQA has
- (1) an adapter config and
 - (2) optionally training params
 
The equivalent for the PEFT bench task would be the (1) adapter config and (2) the benchmark config. The adapter config should be different for each experiment. The benchmark config should only differ if there's something that really needs changing, like the number of inference runs because the measurement error is still too high. But ideally only the adapter config changes between experiments so that every experiment outcome is comparable. This means that there will be one default config for each experiment (regarding the benchmark setup) and one adapter config for each experiment. This would also remove the need for the "peft_config_variant" field and there would be a clear separation between benchmark setup and adapter config. WDYT?
| 
           I have covered all the changes mentioned in the recent replies , also about the configs we can take your approach just verify these changes and we can work on the other part  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks good, thank you!
Individual results
Regarding the results, I was wondering about storing the individual results as well, alongside the averaged values. For example results["generation_config"]["by_category"]["short"]["samples"][0] = {"inference_time": .., "generated_tokens": ..., etc.}. One of the scenarios I have in mind where this might be beneficial is, that you can get 20 generated tokens in a lot of ways, i.e. (35 + 5)/2 = (20+20)/2 = (30 + 10) / 2 = ... - you could then falsely claim that method X is slower on average but in reality there may have been a bug which generated one very long sequence and very short ones after (and this is important since runtime is not necessarily linear, especially for transformers). If we store the individual results as well then we could easily spot that the sequence length distribution is off.
Meta Info
I was comparing the result files of the peft_bench and MetaMathQA and I found the meta_info to be a lot richer. From the README:
"meta_info": {
  "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
  "model_created_at": "2024-09-18T15:23:48+00:00",
  "dataset_sha": "aa4f34d3d2d3231299b5b03d9b3e5a20da45aa18",
  "dataset_created_at": "2023-09-21T17:22:46+00:00",
  "package_info": {
    "transformers-version": "4.50.0.dev0",
    "transformers-commit-hash": "752ef3fd4e70869626ec70657a770a85c0ad9219",
    "peft-version": "0.14.1.dev0",
    "peft-commit-hash": "a447a4e5ecd87b7d57733f4df9616a328cf130f4",
    "datasets-version": "3.3.2",
    "datasets-commit-hash": null,
    "bitsandbytes-version": "0.45.2",
    "bitsandbytes-commit-hash": null,
    "torch-version": "2.6.0+cu124",
    "torch-commit-hash": null
  },
  "system_info": {
    "system": "Linux",
    "release": "6.11.0-17-generic",
    "version": "#17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2",
    "machine": "x86_64",
    "processor": "x86_64",
    "gpu": "NVIDIA GeForce RTX 4090"
  },
  "pytorch_info": "PyTorch built with: [...]"
}
I think it would make sense to include this in this benchmark as well. It would have the benefit of having all important system info in one place which makes it easier to pinpoint differences because of the environment. We could re-use the code from there. The info that currently resides in meta_info is more akin to run_info IMO. WDYT?
Config inclusion
I think that both the PEFT and the benchmark config should be included in the results so that the result is a representation of the benchmark state (doesn't cost much and is way easier to handle). Something like results["run_info"]["peft_config"] = <peft adapter config> and results["run_info"]["benchmark_config"] = <benchmark config>.
| - **Trainable Parameters**: Number of parameters being fine-tuned | ||
| - **Parameter Ratio**: Percentage of total parameters being trained | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no training, what do you mean by training here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont remember ,removed
| @@ -0,0 +1,11 @@ | |||
| { | |||
| "model_id": "facebook/opt-350m", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense, model-wise, to use a more modern and popular model like Llama-3.2-3B instead of OPT.
| "num_inference_runs": 10, | ||
| "train_batch_size": 2, | ||
| "train_steps": 3, | ||
| "num_prompt_samples": 1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we decided to use all prompts instead of samplig / using a subset, can we remove the num_prompt_samples parameter?
| "train_batch_size": 2, | ||
| "train_steps": 3, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "train_batch_size": 2, | |
| "train_steps": 3, | 
Can be removed?
…,inputs outof loop
| 
           thanks for the detailed review and keeping up with the PR <3 
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I still discovered some issues, most notably I couldn't run the run_base.py script successfully, could you please check?
about the first point NAME i am not sure it could me anything like peft inference, fast peft, inference bench as you guys wish ,
How about "Text Generation Benchmark" (directory: text_generation_benchmark)?
added config for max new tokens depending on the prompt length
The differences are now noticeable.
About 3. and 4., okay, let's work on that later.
| base_model_name_or_path="facebook/opt-350m", | ||
| bias="none", | ||
| fan_in_fan_out=False, | ||
| inference_mode=False, | ||
| init_lora_weights=True, | ||
| lora_alpha=16, | ||
| lora_dropout=0.1, | ||
| modules_to_save=None, | ||
| peft_type="LORA", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove those arguments, they don't really add anything for the purpose of this benchmark.
| print_fn(f" {metric_name.replace('_', ' ').title()}: {value}") | ||
| overall = benchmark_result.generation_info.get("overall") | ||
| if overall: | ||
| # Use the same formatting as in Detailed Metrics | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment
        
          
                method_comparison/peft_bench/run.py
              
                Outdated
          
        
      | print_fn("Computing base model inference times...") | ||
| 
               | 
          ||
| # Prepare benchmark prompts | ||
| # Prepare benchmark prompts | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
| 
               | 
          ||
| """ | ||
| Script to measure base model inference times and save results for reuse. | ||
| This should be run once per model configuration to avoid redundant computations. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this info also to the README
        
          
                method_comparison/peft_bench/run.py
              
                Outdated
          
        
      | print_fn("Using cached base model results...") | ||
| base_inference_times = base_results["inference_results"] | ||
| else: | ||
| print_fn("No cached base results found. Please run run_base.py first.") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just raise an error here and require the base results to be computed first. Otherwise, there is too much risk of missing this message and then, when running dozens of script, we re-compute the base results dozens of times.
| import traceback | ||
| 
               | 
          ||
| traceback.print_exc() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this, just remove the whole try ... except to get the same result.
| def main(): | ||
| """Main entry point for the base model benchmark runner.""" | ||
| parser = argparse.ArgumentParser(description="Run base model benchmarks") | ||
| parser.add_argument("experiment_path", help="Path to experiment directory (to load config)") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we have to pass the experiment path here, should this not always use the default benchmark settings?
        
          
                method_comparison/peft_bench/run.py
              
                Outdated
          
        
      | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | ||
| 
               | 
          ||
| # Get category-specific max_new_tokens | ||
| cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When category_generation_params is None, this line fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this could be none ?
| 
           
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, but I still have trouble creating the base benchmark results. I think if you follow my comments, the problem should be fixed. Ideally, before you push your final changes, you should test locally if the scripts pass successfully. Also, please run ruff on this directory.
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | ||
| 
               | 
          ||
| # Get category-specific max_new_tokens | ||
| cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this could be none ?
In run_base.py, category_generation_params are not passed, so the default value of None is used. Try running python run_base.py and you should encounter the error. Let's remove the default value for category_generation_params and instead require the dict to always be passed.
| 
               | 
          ||
| **Usage:** | ||
| ```bash | ||
| python run_base.py experiments/lora/lora_r8 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling the script like this is not possible, as the argument parser does not recognize the argument. Also, the base model results should not depend on any specific experiment. Instead, it should read the default_benchmark_params.json.
Co-authored-by: Benjamin Bossan <[email protected]>
| 
           thanks for the review , covered all the changes 
 will keep this in mind 🫡  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, from my point of view, not much is missing. I had one more comment on the code, apart from that:
- Please create the required empty result directories (
results/etc.) and place a.gitkeepinside, then add them. - Just a suggestion: How about running prompts in reverse order, i.e. starting with the longest first? That way, the most likely setting to OOM runs first, wasting less of the users time in case it happens.
 
| # Configure print function based on verbosity | ||
| print_fn = print if args.verbose else lambda *args, **kwargs: None | ||
| 
               | 
          ||
| default_experiment_path = "experiments/lora/lora_r8" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this argument. For the base model, I don't think it makes sense to call validate_experiment_path at all.
          
 can we do this in a separate pr ?  | 
    
          
 Okay, that works. Could you please create the empty folders for   | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the text generation benchmark to the PEFT method comparison suite and for your patience with the review. I think this is now in a good enough state that it can be merged.
Just so that we don't forget, there are still a few todos for this task, listed below:
- Add config files for all PEFT methods (can be mostly based on the existing ones for MetaMathQA, using the default settings).
 - Running the prompt groups in reverse order, i.e. longest first, to hit potential OOMs as early as possible.
 - Make the code device agnostic, like in #2610.
 - Add a Makefile similar to what we have in MetaMathQA.
 -  Update the 
method_comparison/app.pyto include the results from this benchmark. 
| 
           The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.  | 
    
| 
           Note: The task of making the benchmark more device agnostic has already been taken care of in #2730.  | 
    
scriptys PR
docs PR
this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think