Add: save_pretrained readme

rahul-tuli · rahul-tuli · commit 4a2ae1d4f5a3 · 2025-04-23T16:11:58.000-04:00
Signed-off-by: Rahul Tuli &lt;rtuli@redhat.com&gt;
diff --git a/docs/save_pretrained.md b/docs/save_pretrained.md
@@ -0,0 +1,107 @@
+# Enhanced `save_pretrained` Arguments
+
+The `llmcompressor` library extends Hugging Face's `save_pretrained` method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
+
+## How It Works
+
+When you import `llmcompressor`, it automatically wraps the model's original `save_pretrained` method with an enhanced version that supports compression. This happens in two ways:
+
+1. **Direct modification**: When you call `modify_save_pretrained(model)` directly
+2. **Automatic wrapping**: When you call `oneshot(...)`, which wraps `save_pretrained` under the hood
+
+This means that after applying compression with `oneshot`, your model's `save_pretrained` method is already enhanced with compression capabilities, and you can use the additional arguments described below.
+
+## Additional Arguments
+
+When saving your compressed models, you can use the following extra arguments with the `save_pretrained` method:
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `sparsity_config` | `Optional[SparsityCompressionConfig]` | `None` | Optional configuration for sparsity compression. If None and `skip_sparsity_compression_stats` is False, configuration will be automatically inferred from the model. |
+| `quantization_format` | `Optional[str]` | `None` | Optional format string for quantization. If not provided, it will be inferred from the model. |
+| `save_compressed` | `bool` | `True` | Controls whether to save the model in a compressed format. Set to `False` to save in the original dense format. |
+| `skip_sparsity_compression_stats` | `bool` | `True` | Controls whether to skip calculating sparsity statistics (e.g., global sparsity and structure) when saving the model. Set to `False` to include these statistics. |
+| `disable_sparse_compression` | `bool` | `False` | When set to `True`, skips any sparse compression during save, even if the model has been previously compressed. |
+
+## Examples
+
+### Applying Compression with oneshot
+
+The simplest approach is to use `oneshot`, which handles both compression and wrapping `save_pretrained`:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-model")
+tokenizer = AutoTokenizer.from_pretrained("your-model")
+
+# Apply compression - this also wraps save_pretrained
+oneshot(
+    model=model,
+    recipe=[GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"])],
+    # Other oneshot parameters...
+)
+
+# Now you can use the enhanced save_pretrained
+SAVE_DIR = "your-model-W8A8-compressed"
+model.save_pretrained(
+    SAVE_DIR,
+    save_compressed=True  # Use the enhanced functionality
+)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+### Manual Approach (Without oneshot)
+
+If you need more control, you can wrap `save_pretrained` manually:
+
+```python
+from transformers import AutoModelForCausalLM
+from llmcompressor.transformers.sparsification import modify_save_pretrained
+
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-model")
+
+# Manually wrap save_pretrained
+modify_save_pretrained(model)
+
+# Now you can use the enhanced save_pretrained
+model.save_pretrained(
+    "your-model-path",
+    save_compressed=True,
+    skip_sparsity_compression_stats=False # to infer sparsity config
+)
+```
+
+### Saving with Custom Sparsity Configuration
+
+```python
+from compressed_tensors.sparsification import SparsityCompressionConfig
+
+# Create custom sparsity config
+custom_config = SparsityCompressionConfig(
+    format="2:4",
+    block_size=16
+)
+
+# Save with custom config
+model.save_pretrained(
+    "your-model-custom-sparse",
+    sparsity_config=custom_config,
+)
+```
+
+## Notes
+
+- When loading compressed models with `from_pretrained`, the compression format is automatically detected.
+- To use compressed models with vLLM, simply load them as you would any model:
+  ```python
+  from vllm import LLM
+  model = LLM("./your-model-compressed")
+  ```
+- Compression configurations are saved in the model's config file and are automatically applied when loading.
+
+For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.
diff --git a/src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py b/src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py
@@ -43,6 +43,9 @@ def modify_save_pretrained(model: PreTrainedModel) -> None:
     2. Saves the recipe, appending any current recipes to existing recipe files
     3. Copies any necessary python files from the model cache
 
+    For more information on the compression parameterrs and model saving in
+    llmcompressor, refer to docs/save_pretrained.md
+
     :param model: The model whose save_pretrained method will be modified
     """
     original = model.save_pretrained