Skip to content

Conversation

@ved1beta
Copy link
Contributor

@ved1beta ved1beta commented Apr 30, 2025

scriptys PR
docs PR

this are the script format you asked for , similar metamathAQ directory
please hahve a look let me know the changes in detail we have the numbers and all
will need to add more examples thoo let me know what you think

@BenjaminBossan
Copy link
Member

Thanks a lot for the PR. I didn't have time to look into the details yet, hopefully will do that on Friday. In the meantime, could you please delete all the results (benchmark_result.json)? We will run the experiments on our hardware to get reproducible results.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your hard work on this topic. I think this PR brings us a lot closer to a feature with similar usability as the MetaMathQA training suite. There are still some bigger and smaller areas for improvement, but I'm sure we're gonna get there.

Regarding the result log, it is currently just a flat json. I would like to see more structure there. In MetaMathQA, we have a json with 3 keys: run_info, train_info, and meta_info. Let's try to structure the results here similarly. Especially the meta info is currently completely absent. Let's add something similar to what we have in MetaMathQA.

We're working with untrained adapters. This should generally be fine, as most adapters like LoRA don't influence the output when not trained, so the model generations should be identical as for the base model. There are some PEFT methods that cannot be zero-initialized, however, which means for these methods the generations will look different. I think we can mitigate this by tracking the generation time per token, so that longer generations are not automatically penalized.

One thing that would be good to improve is the parametrization of the PEFT configs. I don't have a good proposal how, but let me explain what I mean: Right now, there is a LoRA config with rank 8 and another one with rank 16. The rest is identical. If we want to add more ranks, each time, we need to create a copy. And what if we want to parametrize another setting? The number of configs would increase polynomially. Ideally, we would only have a single LoRA config with the fixed parameters and then another way to define the changing parameters. Do you have some ideas?

Also, please add a README.md and run make style before pushing your changes.

@ved1beta
Copy link
Contributor Author

ved1beta commented May 5, 2025

tried to cover all the changes please have a look : )

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the updates, we're moving in the right direction. Unfortunately, due to some issues that I commented on, I could not run the script successfully. Could you please check and update the PR? Also, some of my previous comments are still unaddressed.

@ved1beta
Copy link
Contributor Author

ved1beta commented May 5, 2025

i am not sure why its not running on yours rest we can mostly fix , please can you guide me with that

@ved1beta ved1beta force-pushed the benchmark2scripts branch from ea4bb46 to 627a038 Compare May 5, 2025 17:15
@ved1beta
Copy link
Contributor Author

ved1beta commented May 5, 2025

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose
need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

@githubnemo
Copy link
Collaborator

there are some import issue so we cant run it directly cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose need to specify the path and all to make it work i am working on it let me know if you have any idea how to fix it

Can you be a bit more specific on what import errors you're experiencing?

@ved1beta
Copy link
Contributor Author

ved1beta commented May 13, 2025

@githubnemo
The import errors occur because the script uses relative imports (from data import ... and from utils import ...) but these modules are in the same directory. we can either:

  1. Use absolute imports:
from method_comparison.peft_bench.data import prepare_benchmark_prompts
from method_comparison.peft_bench.utils import ...
  1. Or add the directory to the Python path in the script:
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
cd /home/ved/code/git/peft && DISABLE_FLASH_ATTN=1 PYTHONPATH=. python3 method_comparison/peft_bench/run.py method_comparison/peft_bench/experiments/lora/lora_r16 --verbose

hey @BenjaminBossan please can you have look and give me the final bunch changes , willing to finish this ASAP : )

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this.
I'm taking over the review since @BenjaminBossan is currently OoO.

Regarding the import path issues (and the experiment path resolving) I'd suggest to limit the execution of experiments to the method_comparison/peft_bench folder - every other folder is unsupported, making the code a bit simpler, I think.

I think there are still some comments from @BenjaminBossan that are unresolved regarding number of iterations in the LoRA experiment configs and regarding the default config behavior - I've added some comments of my own on top.

@ved1beta
Copy link
Contributor Author

ved1beta commented May 24, 2025

did all the required changes from above (from you) we can resolved all conversations . please go through it and let me know

Co-authored-by: Benjamin Bossan <[email protected]>
@ved1beta
Copy link
Contributor Author

ved1beta commented Jun 6, 2025

i know i have not covered all of the reviews due to their large number, and now all of them are mixed up solved/unsolved
i would still love to complete this, just let me know what is expected
thanks

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! You covered a lot already but I get it, there are a lot of comments. I went over all comments and marked quite a few old ones from previous reviews as resolved - those that are still open should have a comment stating why they are still open or are discussed in this comment.

Here's an attempt to list the major things that are still open (the unresolved comments are still valid, as far as I can see, but let's focus on the big things first):

prompt categories: As Benjamin already commented it is probably best to run the benchmark across all prompt categories but track the metrics separately so that the results are always comparable (i.e. it is the same task for each experiment but the model parameters differ). See this comment. This ties in to how we will manage configs, I think.

config arrangement: Using a default config for each experiment and just configuring the difference is still unresolved, at least partially. See this comment for the general remark. Let me make a suggestion and let's discuss the best way of handling config. MetaMathQA has

  • (1) an adapter config and
  • (2) optionally training params

The equivalent for the PEFT bench task would be the (1) adapter config and (2) the benchmark config. The adapter config should be different for each experiment. The benchmark config should only differ if there's something that really needs changing, like the number of inference runs because the measurement error is still too high. But ideally only the adapter config changes between experiments so that every experiment outcome is comparable. This means that there will be one default config for each experiment (regarding the benchmark setup) and one adapter config for each experiment. This would also remove the need for the "peft_config_variant" field and there would be a clear separation between benchmark setup and adapter config. WDYT?

@ved1beta
Copy link
Contributor Author

ved1beta commented Jun 6, 2025

I have covered all the changes mentioned in the recent replies , also about the configs we can take your approach just verify these changes and we can work on the other part

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good, thank you!

Individual results

Regarding the results, I was wondering about storing the individual results as well, alongside the averaged values. For example results["generation_config"]["by_category"]["short"]["samples"][0] = {"inference_time": .., "generated_tokens": ..., etc.}. One of the scenarios I have in mind where this might be beneficial is, that you can get 20 generated tokens in a lot of ways, i.e. (35 + 5)/2 = (20+20)/2 = (30 + 10) / 2 = ... - you could then falsely claim that method X is slower on average but in reality there may have been a bug which generated one very long sequence and very short ones after (and this is important since runtime is not necessarily linear, especially for transformers). If we store the individual results as well then we could easily spot that the sequence length distribution is off.

Meta Info

I was comparing the result files of the peft_bench and MetaMathQA and I found the meta_info to be a lot richer. From the README:

"meta_info": {
  "model_sha": "13afe5124825b4f3751f836b40dafda64c1ed062",
  "model_created_at": "2024-09-18T15:23:48+00:00",
  "dataset_sha": "aa4f34d3d2d3231299b5b03d9b3e5a20da45aa18",
  "dataset_created_at": "2023-09-21T17:22:46+00:00",
  "package_info": {
    "transformers-version": "4.50.0.dev0",
    "transformers-commit-hash": "752ef3fd4e70869626ec70657a770a85c0ad9219",
    "peft-version": "0.14.1.dev0",
    "peft-commit-hash": "a447a4e5ecd87b7d57733f4df9616a328cf130f4",
    "datasets-version": "3.3.2",
    "datasets-commit-hash": null,
    "bitsandbytes-version": "0.45.2",
    "bitsandbytes-commit-hash": null,
    "torch-version": "2.6.0+cu124",
    "torch-commit-hash": null
  },
  "system_info": {
    "system": "Linux",
    "release": "6.11.0-17-generic",
    "version": "#17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2",
    "machine": "x86_64",
    "processor": "x86_64",
    "gpu": "NVIDIA GeForce RTX 4090"
  },
  "pytorch_info": "PyTorch built with: [...]"
}

I think it would make sense to include this in this benchmark as well. It would have the benefit of having all important system info in one place which makes it easier to pinpoint differences because of the environment. We could re-use the code from there. The info that currently resides in meta_info is more akin to run_info IMO. WDYT?

Config inclusion

I think that both the PEFT and the benchmark config should be included in the results so that the result is a representation of the benchmark state (doesn't cost much and is way easier to handle). Something like results["run_info"]["peft_config"] = <peft adapter config> and results["run_info"]["benchmark_config"] = <benchmark config>.

Comment on lines 165 to 166
- **Trainable Parameters**: Number of parameters being fine-tuned
- **Parameter Ratio**: Percentage of total parameters being trained
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no training, what do you mean by training here?

Copy link
Contributor Author

@ved1beta ved1beta Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont remember ,removed

@@ -0,0 +1,11 @@
{
"model_id": "facebook/opt-350m",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense, model-wise, to use a more modern and popular model like Llama-3.2-3B instead of OPT.

"num_inference_runs": 10,
"train_batch_size": 2,
"train_steps": 3,
"num_prompt_samples": 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we decided to use all prompts instead of samplig / using a subset, can we remove the num_prompt_samples parameter?

Comment on lines 8 to 9
"train_batch_size": 2,
"train_steps": 3,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"train_batch_size": 2,
"train_steps": 3,

Can be removed?

@ved1beta
Copy link
Contributor Author

ved1beta commented Jul 26, 2025

thanks for the detailed review and keeping up with the PR <3
covered all the comments except one
Please close the comments you think are resolved

  1. about the first point NAME i am not sure it could me anything like peft inference, fast peft, inference bench as you guys wish ,
  2. added config for max new tokens depending on the prompt length\
  3. lets cover both of these in a separate PR
  4. done

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I still discovered some issues, most notably I couldn't run the run_base.py script successfully, could you please check?

about the first point NAME i am not sure it could me anything like peft inference, fast peft, inference bench as you guys wish ,

How about "Text Generation Benchmark" (directory: text_generation_benchmark)?

added config for max new tokens depending on the prompt length

The differences are now noticeable.

About 3. and 4., okay, let's work on that later.

Comment on lines 104 to 112
base_model_name_or_path="facebook/opt-350m",
bias="none",
fan_in_fan_out=False,
inference_mode=False,
init_lora_weights=True,
lora_alpha=16,
lora_dropout=0.1,
modules_to_save=None,
peft_type="LORA",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove those arguments, they don't really add anything for the purpose of this benchmark.

print_fn(f" {metric_name.replace('_', ' ').title()}: {value}")
overall = benchmark_result.generation_info.get("overall")
if overall:
# Use the same formatting as in Detailed Metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment

print_fn("Computing base model inference times...")

# Prepare benchmark prompts
# Prepare benchmark prompts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove


"""
Script to measure base model inference times and save results for reuse.
This should be run once per model configuration to avoid redundant computations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this info also to the README

print_fn("Using cached base model results...")
base_inference_times = base_results["inference_results"]
else:
print_fn("No cached base results found. Please run run_base.py first.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just raise an error here and require the base results to be computed first. Otherwise, there is too much risk of missing this message and then, when running dozens of script, we re-compute the base results dozens of times.

Comment on lines 210 to 212
import traceback

traceback.print_exc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, just remove the whole try ... except to get the same result.

def main():
"""Main entry point for the base model benchmark runner."""
parser = argparse.ArgumentParser(description="Run base model benchmarks")
parser.add_argument("experiment_path", help="Path to experiment directory (to load config)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have to pass the experiment path here, should this not always use the default benchmark settings?

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Get category-specific max_new_tokens
cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When category_generation_params is None, this line fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how this could be none ?

@ved1beta
Copy link
Contributor Author

text_generation_benchmark works covered all the changes except the last one we can discuss that above .

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, but I still have trouble creating the base benchmark results. I think if you follow my comments, the problem should be fixed. Ideally, before you push your final changes, you should test locally if the scripts pass successfully. Also, please run ruff on this directory.

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Get category-specific max_new_tokens
cat_max_new_tokens = (category_generation_params.get(category, {}).get("max_new_tokens", max_new_tokens))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how this could be none ?

In run_base.py, category_generation_params are not passed, so the default value of None is used. Try running python run_base.py and you should encounter the error. Let's remove the default value for category_generation_params and instead require the dict to always be passed.


**Usage:**
```bash
python run_base.py experiments/lora/lora_r8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the script like this is not possible, as the argument parser does not recognize the argument. Also, the base model results should not depend on any specific experiment. Instead, it should read the default_benchmark_params.json.

@ved1beta
Copy link
Contributor Author

thanks for the review , covered all the changes

Ideally, before you push your final changes, you should test locally if the scripts pass successfully. Also, please run ruff on this directory.

will keep this in mind 🫡

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, from my point of view, not much is missing. I had one more comment on the code, apart from that:

  • Please create the required empty result directories (results/ etc.) and place a .gitkeep inside, then add them.
  • Just a suggestion: How about running prompts in reverse order, i.e. starting with the longest first? That way, the most likely setting to OOM runs first, wasting less of the users time in case it happens.

# Configure print function based on verbosity
print_fn = print if args.verbose else lambda *args, **kwargs: None

default_experiment_path = "experiments/lora/lora_r8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this argument. For the base model, I don't think it makes sense to call validate_experiment_path at all.

@ved1beta
Copy link
Contributor Author

ved1beta commented Aug 1, 2025

Just a suggestion: How about running prompts in reverse order, i.e. starting with the longest first? That way, the most likely setting to OOM runs first, wasting less of the users time in case it happens.

can we do this in a separate pr ?

@BenjaminBossan
Copy link
Member

can we do this in a separate pr ?

Okay, that works. Could you please create the empty folders for RESULT_PATH_TEMP and RESULT_PATH_CANCELLED too?

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the text generation benchmark to the PEFT method comparison suite and for your patience with the review. I think this is now in a good enough state that it can be merged.

Just so that we don't forget, there are still a few todos for this task, listed below:

  • Add config files for all PEFT methods (can be mostly based on the existing ones for MetaMathQA, using the default settings).
  • Running the prompt groups in reverse order, i.e. longest first, to hit potential OOMs as early as possible.
  • Make the code device agnostic, like in #2610.
  • Add a Makefile similar to what we have in MetaMathQA.
  • Update the method_comparison/app.py to include the results from this benchmark.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan merged commit ec5a1c6 into huggingface:main Aug 7, 2025
2 of 14 checks passed
@BenjaminBossan
Copy link
Member

Note: The task of making the benchmark more device agnostic has already been taken care of in #2730.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants