Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4d03d65
bench mark scripts
ved1beta Apr 30, 2025
18733f5
json removed
ved1beta Apr 30, 2025
b31236e
required changes
ved1beta May 5, 2025
e65b2be
read me commad fix '
ved1beta May 5, 2025
627a038
required changes
ved1beta May 5, 2025
2de1796
configs
ved1beta May 5, 2025
03dc404
readme
ved1beta May 5, 2025
ec00ec6
import handler
ved1beta May 6, 2025
9b476f2
required chnages
ved1beta May 24, 2025
92e1d64
base model name/path
ved1beta Jun 6, 2025
4b67c62
defualt json
ved1beta Jun 6, 2025
338d9ad
ruff
ved1beta Jun 6, 2025
75fdd77
format
ved1beta Jun 6, 2025
04069d8
updated read me
ved1beta Jun 6, 2025
bc6e729
Merge branch 'benchmark2scripts' of github.com:ved1beta/peft into ben…
ved1beta Jun 6, 2025
8c0f9cb
requested read me changes
ved1beta Jun 12, 2025
c2af755
[200~python3 run.py experiments/lora/lora_r8 --verbose
ved1beta Jun 12, 2025
e165516
individual results storage + requested chnages
ved1beta Jun 12, 2025
6054617
train parameters removed
ved1beta Jun 12, 2025
a441851
removed sample_config
ved1beta Jun 12, 2025
ffa2033
num inference and config removed
ved1beta Jun 12, 2025
58eacb9
model name change - removed selectPrompts
ved1beta Jun 12, 2025
2d34bc5
removed imports n related
ved1beta Jun 26, 2025
68f9496
timestams from file name removed
ved1beta Jun 26, 2025
2710e62
undo commit
ved1beta Jun 26, 2025
86d5215
overall section + required changes
ved1beta Jul 1, 2025
c6e2fbe
undo change info + peft_config.dict not none
ved1beta Jul 5, 2025
b792d3a
feat naming func according to peft method + minNewTokens=maxNewTokens
ved1beta Jul 8, 2025
ff7114d
to dict conversion removed
ved1beta Jul 26, 2025
12edab9
requested changes include:remove traceback,optional added,dtype error…
ved1beta Jul 26, 2025
b3858fc
run_base integreation
ved1beta Jul 26, 2025
7002a81
required read me changes
ved1beta Jul 26, 2025
2611c67
Overall Metrics format
ved1beta Jul 26, 2025
d58658d
changed branch logic to match MetaMathQA
ved1beta Jul 26, 2025
ecbcdba
removed
ved1beta Jul 26, 2025
4780ae5
ruff
ved1beta Jul 26, 2025
7b54782
max token vary 20-50-100
ved1beta Jul 26, 2025
ecc1382
requested chanegs
ved1beta Jul 29, 2025
b10f873
text_generation_benchmark
ved1beta Jul 29, 2025
7581b1d
Update method_comparison/text_generation_benchmark/run.py
ved1beta Jul 31, 2025
f3f992d
category_generation_params added to run_base + additional
ved1beta Jul 31, 2025
02c0872
style comments
ved1beta Aug 1, 2025
a3be879
.git keep added
ved1beta Aug 1, 2025
c309e02
ruff
ved1beta Aug 1, 2025
2f9a537
added cancelled_results/ temporary_results/
ved1beta Aug 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions method_comparison/text_generation_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
## Base Model Inference Caching

The benchmarking suite uses a separate script, `run_base.py`, to measure base model inference times and save results for reuse. This should be run once per model configuration to avoid redundant computations and ensure consistent baseline metrics for all PEFT experiments.

**Usage:**
```bash
python run_base.py
```
This will cache the base model inference results for the specified configuration. Subsequent runs of `run.py` will automatically load these cached results.

# PEFT Benchmarking Suite

This directory contains a comprehensive benchmarking framework for Parameter-Efficient Fine-Tuning (PEFT) methods. For the task of text generation, the suite measures inference performance, memory usage, and other key metrics across different PEFT configurations.

## Overview

The benchmarking suite provides:
- **Inference time measurement** across different prompt categories
- **Memory usage during inference** (RAM and GPU)
- **Parameter efficiency metrics** (trainable vs total parameters)
- **Time per token analysis** for fair comparison across different generation lengths
- **Structured result logging** with detailed metadata

## Architecture

The suite follows a clean separation between:
1. **Default benchmark configuration** - shared settings for consistent comparison
2. **Individual adapter configurations** - PEFT-specific parameters for each experiment

This ensures that all experiments are comparable while allowing flexibility in adapter parameters.

## Quick Start

### Running a Single Experiment

```bash
# From the peft_bench directory
python run.py experiments/lora/lora_r8 --verbose
```

## Configuration Structure

The benchmarking suite uses a hierarchical configuration system:

1. **Default benchmark parameters** (`default_benchmark_params.json`) - Base configuration shared by all experiments
2. **Experiment-specific overrides** (`benchmark_params.json` in each experiment) - Optional overrides for specific experiments
3. **Adapter configuration** (`adapter_config.json` in each experiment) - PEFT method parameters

This structure ensures consistent comparison while allowing flexibility where needed.

### Default Configuration (`default_benchmark_params.json`)

Contains shared benchmark settings that apply to all experiments. Here are the key configuration fields:

- `model_id`: The Hugging Face model ID to use as the base model (e.g., "facebook/opt-350m")
- `dtype`: Model precision ("float16", "float32", or "bfloat16")
- `seed`: Random seed for reproducibility
- `max_new_tokens`: Maximum number of tokens to generate during inference
- `num_inference_runs`: Number of inference runs per prompt for statistical reliability
- `use_4bit`: Whether to use 4-bit quantization (bool)
- `use_8bit`: Whether to use 8-bit quantization (bool)

Each experiment can override these settings by providing its own `benchmark_params.json` file.

### Experiment Structure

Each experiment directory should contain:

1. `adapter_config.json`: PEFT adapter configuration. For details on available parameters and their meanings, refer to the [PEFT documentation](https://huggingface.co/docs/peft/main/en/developer_guides/adapters).

2. (Optional) `benchmark_params.json`: Override specific benchmark parameters for this experiment.

Example directory structure:
```
experiments/
└── lora/
├── lora_r8/ # LoRA rank 8 experiment
│ ├── adapter_config.json # PEFT adapter configuration
│ └── benchmark_params.json # Optional benchmark overrides
└── lora_r16/ # LoRA rank 16 experiment
└── adapter_config.json
```

### Experiment-Specific Overrides Example

If an experiment needs different benchmark settings, create `benchmark_params.json`:
```json
{
"_comment": "Override settings for this specific experiment",
"max_new_tokens": 50,
"num_inference_runs": 15,
"num_prompt_samples": 2
}
```

These parameters will override the defaults from `default_benchmark_params.json`. However, the defaults should generally not be changed to keep the results from the individual experiments comparable.

### Create a New Experiment Adapter Configuration

To create a new experiment, follow these steps:

1. **Create the experiment directory**
```bash
mkdir -p experiments/lora/lora_r8
```

2. **Generate the adapter configuration programmatically**
Use the PEFT library to create and save your adapter config:

```python
from peft import LoraConfig

config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM"
)
config.save_pretrained("experiments/lora/lora_r8")
```

This will create an `adapter_config.json` in your experiment directory. Adjust parameters as needed for your experiment.

3. **(Optional) Add benchmark overrides**
If you need to override default benchmark settings, create a `benchmark_params.json` in the same directory.

4. **Run the benchmark**
```bash
python run.py experiments/lora/lora_r8 --verbose
```

## Prompt Categories

The benchmark automatically runs across all prompt categories for consistent comparison:
- **short** - Brief prompts (1-2 sentences)
- **medium** - Moderate length prompts (paragraph-level)
- **long** - Extended prompts (multiple paragraphs)

Results are tracked separately for each category, allowing analysis of how different PEFT methods perform across varying input lengths.

## Results Structure

Results are saved in a structured JSON format with three main sections:

### `run_info`
- Execution metadata (timestamp, duration, status)
- Hardware information (GPU type, CUDA version, etc.)
- Error information (if applicable)
- PEFT and benchmark configurations

### `generation_info`
- Memory usage logs at different stages
- Per-category metrics (inference time, time per token, etc.)
- Overall aggregated metrics
- Individual sample results for detailed analysis

### `meta_info`
- Model information (ID, PEFT method)
- Parameter counts (adapter, total, ratio)
- Model size information (base model, adapter)
- System and package information

## Key Metrics

### Inference Performance
- **Inference Time**: Total time for generation per category
- **Time Per Token**: Normalized time accounting for different generation lengths
- **Inference Overhead**: Percentage increase compared to base model

### Memory Usage
- **Peak GPU Memory**: Maximum GPU memory during benchmark
- **Peak RAM Memory**: Maximum RAM usage
- **Memory Logs**: Detailed tracking at each stage

### Parameter Efficiency
- **Adapter Parameters**: Number of parameters in the PEFT adapter
- **Parameter Ratio**: Percentage of total model parameters that are in the adapter
- **Adapter Size**: Memory footprint of the adapter in MB
Empty file.
23 changes: 23 additions & 0 deletions method_comparison/text_generation_benchmark/configs/prompts.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"short": [
"Explain quantum computing in one paragraph.",
"Write a haiku about machine learning.",
"What's the difference between supervised and unsupervised learning?",
"Define parameter-efficient fine-tuning in one sentence.",
"List three applications of natural language processing."
],
"medium": [
"Explain the concept of low-rank adaptation (LoRA) for large language models. Include its benefits and limitations.",
"Compare and contrast prompt tuning and prefix tuning approaches for adapting large language models.",
"What are the key differences between full fine-tuning and parameter-efficient methods? Explain with examples.",
"Describe the process of quantization for neural networks and how it affects model size and inference speed.",
"Explain how sparse expert models like Mixture of Experts work and their advantages over dense models."
],
"long": [
"Analyze the evolution of parameter-efficient fine-tuning methods from 2020 to present. Include a detailed comparison of at least five different approaches, their theoretical foundations, and practical implications for deploying large language models.",
"Provide a comprehensive tutorial on implementing LoRA for a transformer-based language model. Include code examples, hyperparameter selection guidance, and best practices for training and deployment.",
"Compare the computational efficiency, parameter count, and performance characteristics of different PEFT methods (LoRA, Prefix Tuning, Prompt Tuning, IA3, AdaLoRA) across various downstream tasks. Include a discussion of when each method is most appropriate.",
"Explain the mathematical foundations of various parameter-efficient fine-tuning techniques. Discuss how each technique modifies the original neural network architecture and the optimization challenges involved.",
"Discuss the ethical implications of parameter-efficient fine-tuning methods in democratizing access to large language models. Include considerations about computational resources, environmental impact, and accessibility for researchers in resource-constrained settings."
]
}
119 changes: 119 additions & 0 deletions method_comparison/text_generation_benchmark/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Copyright 2025-present the HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Data handling utilities for PEFT benchmarking.
"""

import json
import os
from typing import Optional

from transformers import PreTrainedTokenizer
from utils import BenchmarkConfig


DEFAULT_PROMPTS_PATH = os.path.join(os.path.dirname(__file__), "configs", "prompts.json")


def load_test_prompts(config: dict) -> dict[str, list[str]]:
"""
Load prompts from JSON file.

Args:
config: Configuration containing prompts file path

Returns:
dictionary with prompts by category
"""
prompts_file = getattr(config, "prompts_file", DEFAULT_PROMPTS_PATH)

with open(prompts_file) as f:
prompts = json.load(f)

return prompts


def truncate_prompt_for_model(
prompt: str,
tokenizer: PreTrainedTokenizer,
max_length: Optional[int] = None,
reserve_output_tokens: int = 50,
) -> str:
"""
Truncate a prompt to fit within the model's context window.

Args:
prompt: Input prompt
tokenizer: Model tokenizer
max_length: Maximum sequence length (if None, uses model's max_length)
reserve_output_tokens: Number of tokens to reserve for response

Returns:
Truncated prompt
"""
if max_length is None:
if hasattr(tokenizer, "model_max_length"):
max_length = tokenizer.model_max_length
else:
max_length = 2048

max_prompt_length = max_length - reserve_output_tokens
input_ids = tokenizer.encode(prompt, return_tensors="pt")[0]

if len(input_ids) <= max_prompt_length:
return prompt

truncated_ids = input_ids[:max_prompt_length]
truncated_prompt = tokenizer.decode(truncated_ids, skip_special_tokens=True)

return truncated_prompt


def prepare_benchmark_prompts(
config: BenchmarkConfig,
tokenizer: PreTrainedTokenizer,
max_input_length: Optional[int] = None,
seed: int = 42,
) -> dict[str, list[str]]:
"""
Prepare prompts for benchmarking, ensuring appropriate length and variety.
Always returns all prompt categories for consistent benchmarking.

Args:
config: Benchmark configuration
tokenizer: Model tokenizer
max_input_length: Maximum input length (overrides model default if provided)
seed: Random seed (kept for backwards compatibility)

Returns:
Dictionary with processed prompts by category (all categories included)
"""
all_prompts = load_test_prompts(config)

processed_prompts = {}
for category, prompts in all_prompts.items():
truncated_prompts = [
truncate_prompt_for_model(
prompt,
tokenizer,
max_length=max_input_length,
reserve_output_tokens=getattr(config, "reserve_output_tokens", 50),
)
for prompt in prompts
]

processed_prompts[category] = truncated_prompts

return processed_prompts
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"model_id": "meta-llama/Llama-3.2-3B",
"dtype": "float16",
"seed": 42,
"num_inference_runs": 10,
"max_new_tokens": 20,
"category_generation_params": {
"short": {"max_new_tokens": 20},
"medium": {"max_new_tokens": 50},
"long": {"max_new_tokens": 100}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"base_model_name_or_path": null,
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": false,
"init_lora_weights": true,
"lora_alpha": 16,
"lora_dropout": 0.1,
"modules_to_save": null,
"peft_type": "LORA",
"r": 8,
"target_modules": [
"q_proj",
"v_proj"
],
"task_type": "CAUSAL_LM"
}
Empty file.
Loading
Loading