Skip to content

Conversation

@kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jun 28, 2025

Purpose

  • Support saving models with transforms attached

Prerequisites

Semi-Prerequisites

These changes are required in order to support saving models with offloaded Transforms. Models without offloading do not require these changes.

Changes

  • Implement _update_tied_weights which
    • Updates the _dynamic_tied_weights_keys attribute of the transform modules. This property is read by transformers during saving and tells transformers to delete duplicates of the tied weights before saving.
    • Sets the meta tensors of shared weights to be identical, so they can be recognized and deleted by transformers

Testing

  • Add serialization tests

kylesayrs added 30 commits May 30, 2025 13:40
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added 4 commits July 8, 2025 15:42
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/transform_save branch from 49e04b9 to 2e362d2 Compare July 8, 2025 23:16
Signed-off-by: Kyle Sayers <[email protected]>
Base automatically changed from kylesayrs/transform_apply to main July 9, 2025 22:32
@dsikka dsikka dismissed brian-dellabetta’s stale review July 9, 2025 22:32

The base branch was changed.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your style checks have different behavior than what's on main. I've seen this elsewhere, not sure what's causing it. Maybe our version pin on flake8>=3.8.3 is too loose and later versions have different behavior?

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and #391 are both ready to merge right? We can send over to team for review if so

@kylesayrs
Copy link
Collaborator Author

@brian-dellabetta This is ready to merge. Afaict the only thing left from the head branch is applying transformers in higher granularity, which is still unknown whether this actually has an effect on accuracy

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dsikka dsikka merged commit b2df366 into main Aug 7, 2025
1 check passed
@dsikka dsikka deleted the kylesayrs/transform_save branch August 7, 2025 01:12
brian-dellabetta added a commit to vllm-project/llm-compressor that referenced this pull request Aug 13, 2025
## Purpose ##
* Enable offline spinquant-style transforms

## Prerequisites ##
* vllm-project/compressed-tensors#370
* vllm-project/compressed-tensors#412
* vllm-project/compressed-tensors#414

## Changes ##
* Added `spinquant_example.py` to examples folder
* Added `SpinQuantModifier` which handles the construction of a
spinquant-style transform config

## Testing ##
* Added modifier serialization and correctness tests

## Evaluation ##
Using this branch, and [the original SpinQuant
code](https://github.com/facebookresearch/SpinQuant), we see very
similar results for `meta-llama/Llama-3.2-1B-Instruct` with W4A16
quantization. Results are equivalent in hf (in-memory vs serialized and
re-loaded), and very similar in vllm. The symmetric scales calculation
in `llm-compressor` is slightly different than original SpinQuant paper,
which uses the original GPTQ implementation. When this is swapped in,
results are consistent, with hadamard improving results on `gsm8k_llama`
and `arc_challenge_llama`:

Scheme | Impl | gsm8k | gsm8k_llama | arc_challenge_llama
-- | -- | -- | -- | --
Hadamard+W4A16 | LC | 0.2403 | 0.2835 | 0.5262
W4A16 | LC | 0.1964 | 0.1933 | 0.4781
Hadamard+W4A16 | LC+SQscales | 0.1721 | 0.2183 | 0.485
W4A16 | LC+SQscales | 0.207 | 0.1706 | 0.4498
Hadamard+W4A16 | SQ | 0.1736 | 0.2282 | 0.4807
W4A16 | SQ | 0.1986 | 0.1774 | 0.4489

To run LC+SQScales, change [this line in
CT](https://github.com/neuralmagic/compressed-tensors/blob/b2df366797b00330ec765f5891dde14e4cc74c9d/src/compressed_tensors/quantization/utils/helpers.py#L111)
from

```python
scales = max_val_pos / (float(bit_range) / 2)
```
to
```python
scales = max_val_pos / (float(bit_max))
```

<details>
<summary>The following python script was used to generate these
results</summary>

Clone SpinQuant repo and paste this in the top-level directory:
```python
# coding=utf-8
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import torch
from typing import Literal
import os

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from torch import nn
import lm_eval

from transformers import LlamaForCausalLM, AutoTokenizer
import transformers
from train_utils.main import prepare_model
from train_utils.modeling_llama_quant import LlamaForCausalLM as LlamaForCausalLMQuant
from utils.hadamard_utils import random_hadamard_matrix, hadamard_matrix
from utils.process_args import process_args_ptq

# model_id = "meta-llama/Llama-3.1-8B-Instruct"
# model_id = "meta-llama/Llama-3.2-3B-Instruct"
model_id = "meta-llama/Llama-3.2-1B-Instruct"
dtype = torch.bfloat16


class RotateModule(nn.Module):
    def __init__(self, R_init):
        super(RotateModule, self).__init__()
        self.weight = nn.Parameter(R_init.to(torch.float32).to(torch.device("cuda")))

    def forward(self, x, transpose=False):
        if transpose:
            return x @ self.weight
        else:
            return self.weight @ x


def get_sq_model(
    r1r2=Literal["eye", "random-hadamard", "hadamard"],
    w_bits=Literal[4, 16],
    w_clip: bool = False,
) -> LlamaForCausalLMQuant:
    model_args, training_args, ptq_args = process_args_ptq()
    model_args.input_model = model_id
    if w_bits == 4:
        ptq_args.w_bits = 4
        ptq_args.w_groupsize = 128
        ptq_args.w_rtn = True  # if False, GPTQ is used
        ptq_args.w_clip = w_clip
    ptq_args.a_bits = 16
    ptq_args.k_bits = 16
    ptq_args.v_bits = 16

    print("=======ARGS=======", ptq_args)

    config = transformers.AutoConfig.from_pretrained(model_args.input_model)

    # Llama v3.2 specific: Spinquant is not compatiable with tie_word_embeddings, clone lm_head from embed_tokens
    process_word_embeddings = False
    if config.tie_word_embeddings:
        config.tie_word_embeddings = False
        process_word_embeddings = True

    model = LlamaForCausalLMQuant.from_pretrained(
        pretrained_model_name_or_path=model_args.input_model,
        config=config,
        torch_dtype=dtype,
        device_map="cuda",
    )

    if process_word_embeddings:
        model.lm_head.weight.data = model.model.embed_tokens.weight.data.clone()

    model = prepare_model(ptq_args, model)
    for param in model.parameters():
        param.requires_grad = False
    match r1r2:
        case "eye":
            R1 = torch.eye(model.config.hidden_size, device="cuda")
        case "random-hadamard":
            R1 = random_hadamard_matrix(model.config.hidden_size, "cuda")
        case _:
            R1 = hadamard_matrix(model.config.hidden_size, "cuda")
    model.R1 = RotateModule(R1)
    for i in range(model.config.num_hidden_layers):
        # Each head dim = 128 for Llama model
        match r1r2:
            case "eye":
                R2 = torch.eye(
                    model.config.hidden_size // model.config.num_attention_heads,
                    device="cuda",
                )
            case "random-hadamard":
                R2 = random_hadamard_matrix(
                    model.config.hidden_size // model.config.num_attention_heads, "cuda"
                )
            case _:
                R2 = hadamard_matrix(
                    model.config.hidden_size // model.config.num_attention_heads, "cuda"
                )
        model.model.layers[i].self_attn.R2 = RotateModule(R2)

    model.config.use_cache = False

    return model


def get_lc_model(
    r1r2=Literal["eye", "random-hadamard", "hadamard"], w_bits=Literal[4, 16]
) -> LlamaForCausalLM:
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.modifiers.transform import SpinQuantModifier

    model = LlamaForCausalLM.from_pretrained(
        pretrained_model_name_or_path=model_id,
        torch_dtype=dtype,
        device_map="cuda",
    )

    recipe = [
        SpinQuantModifier(
            rotations=[] if r1r2 == "eye" else ["R1", "R2"],
            transform_type="hadamard",
        )
    ]
    if w_bits == 4:
        recipe.append(
            QuantizationModifier(
                targets="Linear",
                scheme="W4A16",
                ignore=["lm_head"],
            )
        )

    oneshot(
        model=model,
        recipe=recipe,
        pipeline="datafree",
        log_dir=None,
    )

    return model


if __name__ == "__main__":
    for scales_impl in ["sq_min_hack", "lc_min_hack"]:
        for r1r2 in ["eye", "hadamard"]:
            for sq_lc in ["sq", "lc"]:
                w_bits = 4

                os.environ["SCALES_IMPL"] = scales_impl

                model = (
                    get_sq_model(r1r2=r1r2, w_bits=w_bits)
                    if sq_lc == "sq"
                    else get_lc_model(r1r2=r1r2, w_bits=w_bits)
                ).to("cuda")

                SAVE_DIR = model_id.split("/")[1] + f"-{scales_impl}-{r1r2}-w4a16"
                model.save_pretrained(SAVE_DIR, save_compressed=True)
                tokenizer = AutoTokenizer.from_pretrained(
                    model_id, trust_remote_code=True
                )
                tokenizer.save_pretrained(SAVE_DIR)

                del model
                del tokenizer
                torch.cuda.empty_cache()

                results = lm_eval.simple_evaluate(
                    # 1) hf in-memory
                    # model=lm_eval.models.huggingface.HFLM(
                    #     pretrained=model,
                    #     batch_size=32,
                    #     add_bos_token=False,
                    # ),
                    # 1/)
                    # 2) vllm serialized
                    model="vllm",
                    model_args={
                        "pretrained": SAVE_DIR,
                        "add_bos_token": False,
                        "dtype": "auto",
                        "max_model_len": 4096,
                        "gpu_memory_utilization": 0.5,
                        "enable_chunked_prefill": True,
                    },
                    # 2/)
                    # 3) hf serialized
                    # model="hf",
                    # model_args={
                    #     "pretrained": SAVE_DIR,
                    #     "add_bos_token": False,
                    #     "dtype": "auto",
                    # },
                    # device="cuda",
                    # 3/)
                    tasks=["gsm8k_llama", "gsm8k", "arc_challenge_llama"],
                    num_fewshot=8,
                    batch_size=32,
                    apply_chat_template=True,
                    fewshot_as_multiturn=True,
                )
                print(
                    f"RESULTS, {model_id} {sq_lc} R1R2 {r1r2} W_BITS {w_bits} SCALEIMPL {scales_impl}"
                )
                print(lm_eval.utils.make_table(results))
```
</details>


## Follow Ups ##
* Infer data free pipeline, even if a transform modifier is included
* Rotations R3 and R4
* Modify example to use GPTQ once basic evaluation has been performed

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Aug 14, 2025
## Purpose ##
* Enable quip-style transforms

## Prerequisites ##
* vllm-project/compressed-tensors#370
* vllm-project/compressed-tensors#412
* vllm-project/compressed-tensors#414

## Changes ##
* Added `quip_example.py` to examples folder
* As made clear in the disclaimer, this example requires minimum
versions of compressed-tensors and transformers to run
* Added `QuIPModifier` which handles the construction of a quip-style
transform config

## Testing ##
* Added modifier serialization and correctness tests

## Evaluation ##
Evaluation performed by @brian-dellabetta 

Evals on Llama 3.2 1B with Quip (num_fewshot 8, limit 1000 to be
compatible with results
[here](https://github.com/vllm-project/llm-compressor/pull/1243/files#diff-bdc27f23c0dc2da352d5c83abdc0f267873edf4d36f88474038b975df75bd8c3R38-R64))
:

| Strat | gsm8k,strict | gsm8k_llama,strict |
|-|-|-|
| FP16 | .352 | .323 |
| Quip | .348 | .322 |
| W4A16 | .180 | .017 |
| Quip+W4A16 | .213 | .141 |

## Follow Ups ##
* Infer data free pipeline, even if a transform modifier is included
* Modify example to use GPTQ once basic evaluation has been performed

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
* add utilities

Signed-off-by: Kyle Sayers <[email protected]>

* add tests

Signed-off-by: Kyle Sayers <[email protected]>

* add additional tests

Signed-off-by: Kyle Sayers <[email protected]>

* add utils and tests

Signed-off-by: Kyle Sayers <[email protected]>

* Implement transform factories

Signed-off-by: Kyle Sayers <[email protected]>

* add permutations

Signed-off-by: Kyle Sayers <[email protected]>

* add delete_offload_module

Signed-off-by: Kyle Sayers <[email protected]>

* key inverses by weight

Signed-off-by: Kyle Sayers <[email protected]>

* fix tests

Signed-off-by: Kyle Sayers <[email protected]>

* standardize random hadamard

Signed-off-by: Kyle Sayers <[email protected]>

* prepend input hooks

Signed-off-by: Kyle Sayers <[email protected]>

* apply sqrt division first

Signed-off-by: Kyle Sayers <[email protected]>

* use divided hadamards

Signed-off-by: Kyle Sayers <[email protected]>

* fix typo

Signed-off-by: Kyle Sayers <[email protected]>

* add random option

Signed-off-by: Kyle Sayers <[email protected]>

* use random seeds, rename matrix multiply

Signed-off-by: Kyle Sayers <[email protected]>

* add deterministic generation to random matrix

Signed-off-by: Kyle Sayers <[email protected]>

* fix perm math

Signed-off-by: Kyle Sayers <[email protected]>

* update docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* update docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* cleanup

Signed-off-by: Kyle Sayers <[email protected]>

* cleanup 2

Signed-off-by: Kyle Sayers <[email protected]>

* make seed optional

Signed-off-by: Kyle Sayers <[email protected]>

* remove iterable check and missing return value

Signed-off-by: Kyle Sayers <[email protected]>

* Remove unrelated changes

* simplify code

Signed-off-by: Kyle Sayers <[email protected]>

* implement apply, use in tests

Signed-off-by: Kyle Sayers <[email protected]>

* use hadamards database file

Signed-off-by: Kyle Sayers <[email protected]>

* try manifest

Signed-off-by: Kyle Sayers <[email protected]>

* try setup, update hadamards list

Signed-off-by: Kyle Sayers <[email protected]>

* fix setup

Signed-off-by: Kyle Sayers <[email protected]>

* add docstrings, cleanup

Signed-off-by: Kyle Sayers <[email protected]>

* fix setup, thank you @dbarbuzzi

Signed-off-by: Kyle Sayers <[email protected]>

* remove numpy, add tests

Signed-off-by: Kyle Sayers <[email protected]>

* solidify dtype, add gpu tests

Signed-off-by: Kyle Sayers <[email protected]>

* fix docstring

Signed-off-by: Kyle Sayers <[email protected]>

* add device option

Signed-off-by: Kyle Sayers <[email protected]>

* construct on execution device, cache on offload device

Signed-off-by: Kyle Sayers <[email protected]>

* save construction device changes for later

Signed-off-by: Kyle Sayers <[email protected]>

* construct on execution device, cache on offload device

* cite nja sloane

Signed-off-by: Kyle Sayers <[email protected]>

* remove dreg

Signed-off-by: Kyle Sayers <[email protected]>

* put on device via safe_open

Signed-off-by: Kyle Sayers <[email protected]>

* nits and docstrings

Signed-off-by: Kyle Sayers <[email protected]>

* update docstring

Signed-off-by: Kyle Sayers <[email protected]>

* Merge

* merge with construct: construct in float32

Signed-off-by: Kyle Sayers <[email protected]>

* construct with same dtype, constructing on fp32 found no difference

Signed-off-by: Kyle Sayers <[email protected]>

* remove unnecessary imports

Signed-off-by: Kyle Sayers <[email protected]>

* bugfixes (vllm-project#375)

Signed-off-by: Brian Dellabetta <[email protected]>

* use factory_kwargs

Signed-off-by: Kyle Sayers <[email protected]>

* add frozen dict to deps

Signed-off-by: Kyle Sayers <[email protected]>

* fix style

Signed-off-by: Kyle Sayers <[email protected]>

* merge

Signed-off-by: Kyle Sayers <[email protected]>

* use delete_offload_module

Signed-off-by: Kyle Sayers <[email protected]>

* add docstrign

Signed-off-by: Kyle Sayers <[email protected]>

* use parametrize

Signed-off-by: Kyle Sayers <[email protected]>

* populate _dynamic_tied_weights_keys

Signed-off-by: Kyle Sayers <[email protected]>

* ensure serializable

Signed-off-by: Kyle Sayers <[email protected]>

* remove extra space

Signed-off-by: Kyle Sayers <[email protected]>

* apply style

Signed-off-by: Kyle Sayers <[email protected]>

* merge dregs

* skip offloading tests until transformers changes land

Signed-off-by: Kyle Sayers <[email protected]>

* use set

Signed-off-by: Kyle Sayers <[email protected]>

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants