Support auto device mapping #781

Kaihui-intel · 2025-09-02T07:37:15Z

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "opensourcerelease/DeepSeek-R1-bf16"
model_name = "/data1/DeepSeek-R1-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=False, torch_dtype="auto")

from auto_round import AutoRound

autoround = AutoRound(model=model, tokenizer=tokenizer, nsamples=512,
                      batch_size=4, low_gpu_mem_usage=False,device_map="auto", seqlen=2048,
                      )

Signed-off-by: Kaihui-intel <[email protected]>

for more information, see https://pre-commit.ci

wenhuach21 · 2025-09-02T07:48:01Z

auto_round/utils.py

+    # Calculate all block linear memory except for the second modulelist
+    total_linear_memory = 0
+    for n, m in model.named_modules():
+        if hasattr(type(m), "__name__") and "ModuleList" in type(m).__name__:


call get_block_names

wenhuach21 · 2025-09-02T07:48:10Z

auto_round/utils.py

+    for n, m in model.named_modules():
+        if hasattr(type(m), "__name__") and "ModuleList" in type(m).__name__:
+            for name, module in m[-1].named_modules():
+                if isinstance(module, torch.nn.Linear):


conv1d is also supportd

wenhuach21 · 2025-09-02T07:49:16Z

auto_round/utils.py

+    """
+    total_memory = bytes_to_gigabytes(torch.cuda.get_device_properties(i).total_memory)
+    reserved_memory = bytes_to_gigabytes(torch.cuda.memory_reserved(i))
+    allocated_memory = bytes_to_gigabytes(torch.cuda.memory_allocated(i))


it would be better to support xpu too. For now, you could raise an exception that xpu does not devcie_map="auto"

Signed-off-by: Kaihui-intel <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Kaihui-intel <[email protected]>

into kaihui/auto_device

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 · 2025-09-02T09:43:03Z

auto_round/utils.py

+    all_blocks = get_block_names(model)
+    m = get_module(model, all_blocks[0][-1])
+    for name, module in m.named_modules():
+        if isinstance(module, (torch.nn.Linear, transformers.pytorch_utils.Conv1D)):


use SUPPORTED DTYPES

wenhuach21 · 2025-09-02T09:46:09Z

auto_round/utils.py

+                            sum(p.numel() for p in module.parameters()) * module.weight.element_size()
+                        )  # Assuming parameters are float32 (4 bytes each)
+                        block_memory += param_size
+                block_memory = block_memory / 1024**3


for vlms, there may be different memory for different blocks. Why not porting the code to quant_blocks function

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 · 2025-09-03T11:13:04Z

auto_round/autoround.py

@@ -217,6 +216,7 @@ def __init__(
        disable_deterministic_algorithms = kwargs.pop("disable_deterministic_algorithms", False)
        static_kv_dtype = kwargs.pop("static_kv_dtype", None)
        self.vlm = kwargs.pop("vlm") if "vlm" in kwargs else False
+        self.mem_expansion_factor = kwargs.pop("mem_expansion_factor", None)


ram_per_param_scale? and better have a comment to show the meaning of this variable

wenhuach21 · 2025-09-03T11:15:08Z

auto_round/autoround.py

-        """Automatically sets the device map for the model based on available GPUs and memory constraints."""
-        num_gpus = torch.cuda.device_count() - 1
-        if num_gpus == 0:
+    def get_block_info(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]:


Could you suggest a more precise name, preferably one that includes ‘mem’?

wenhuach21 · 2025-09-03T11:15:58Z

auto_round/autoround.py

+                    tensors of the first block, assuming bfloat16 or float32 precision.
+        """
+        # Calculate all block linear memory
+        total_linear_memory = 0


total_param_mem?

wenhuach21 · 2025-09-03T11:16:21Z

auto_round/autoround.py

+        if self.low_gpu_mem_usage:
+            return block_memory, 0
+
+        # assuming bfloat16 or float32, input and output


wenhuach21 · 2025-09-03T11:19:36Z

auto_round/autoround.py

+                    device_memory[cuda_devices[device_idx]] -= layer_memory * mem_expansion_factor
+                    if device_idx >= len(cuda_devices):
+                        raise ValueError(
+                            f"model is too large to fit in {num_gpus} GPUs, "


as discussed, for device 0, we use the mem_expansion_factor, for other devices, we just split the remaining parameters. If it's more than the layer_memory * mem_expansion_factor, logger a warning but not an exception

wenhuach21 · 2025-09-03T11:21:09Z

auto_round/autoround.py

+        if self.device_map == "auto":
+            self.set_auto_device_map_in_block(block, input_ids)
+
+
        if self.device_map is not None:
            from accelerate import dispatch_model


please remember to support this in this scenario auto-round --model xxx --devices 0,1,2

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 · 2025-09-04T05:42:18Z

auto_round/autoround.py

@@ -506,39 +507,34 @@ def _set_device_for_matching_module(self, name: str, device: str) -> None:
        else:
            module.tuning_device = device

-    def get_block_info(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]:
+    def get_block_mem(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]:


estimate_tuning_block_mem , predict_tuning_block_mem or something like that

wenhuach21 · 2025-09-04T05:43:19Z

auto_round/autoround.py

+                        logger.warning(
+                            f"Layer {layer_name} may not fit in available GPU memory. "
+                            "Consider lowering ram_per_param_scale, using more GPUs, "
+                            "or reducing model size."


remove reducing model size

Consider using more GPUs or reducing mem_per_param_scale if OOM occurs.

Besides, you need to add one arg mem_per_param_scale in llm.py

wenhuach21 · 2025-09-04T05:44:22Z

auto_round/autoround.py

+                        device_map[layer_name] = device_idx
+                        device_memory[device_idx] -= layer_memory * ram_per_param_scale
+                    else:
+                        logger.warning(


better to use warning_once?

wenhuach21 · 2025-09-04T05:47:15Z

auto_round/autoround.py

        if self.low_gpu_mem_usage:
            return block_memory, 0

-        # assuming bfloat16 or float32, input and output
+        # Assuming bfloat16 or float32, input and output
        input_bytes = 2 if self.amp_dtype != torch.float32 else 4


input_id[0] should have dtype

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 · 2025-09-04T07:06:46Z

auto_round/autoround.py

@@ -2460,6 +2549,10 @@ def _quantize_block(
                    new_layer = convert_fp8_layer_to_linear(m, self.amp_dtype).to(device)
                    set_module(block, n, new_layer)

+        if self.device_map == "auto":
+            self.set_auto_device_map_in_block(block, input_ids)


wenhuach21 · 2025-09-04T07:08:57Z

auto_round/autoround.py

+
+    def set_auto_device_map_in_block(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> None:
+        """Automatically sets the device map for the block based on available GPUs and memory constraints."""
+        num_gpus = torch.cuda.device_count()


better to check whether it is cuda, if it is device like xpu, we should logger a warning and try to use device 0

Signed-off-by: Kaihui-intel <[email protected]>

for more information, see https://pre-commit.ci

Kaihui-intel and others added 3 commits September 2, 2025 13:12

enable auto device map

df5f17e

Signed-off-by: Kaihui-intel <[email protected]>

mv get_block_info to utils

379da1f

Signed-off-by: Kaihui-intel <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b9306c0

for more information, see https://pre-commit.ci

wenhuach21 reviewed Sep 2, 2025

View reviewed changes

Kaihui-intel and others added 5 commits September 2, 2025 04:41

enable low_gpu_usage_mem

c367b61

Signed-off-by: Kaihui-intel <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a6af52d

for more information, see https://pre-commit.ci

support Conv1D and XPU

78858ab

Signed-off-by: Kaihui-intel <[email protected]>

Merge branch 'kaihui/auto_device' of https://github.com/intel/auto-round

3e3a2e9

into kaihui/auto_device

use get_block_names

b892960

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 reviewed Sep 2, 2025

View reviewed changes

FCFS for block layers

d32978e

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 reviewed Sep 3, 2025

View reviewed changes

update comments&First-Fill

7710686

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 reviewed Sep 4, 2025

View reviewed changes

merge main

09d6d2d

Signed-off-by: Kaihui-intel <[email protected]>

wenhuach21 reviewed Sep 4, 2025

View reviewed changes

Kaihui-intel added 2 commits September 4, 2025 03:59

update comments

33c7ff8

Signed-off-by: Kaihui-intel <[email protected]>

support 0,1,2 & rtn

dd6e3ce

Signed-off-by: Kaihui-intel <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

871c3b4

for more information, see https://pre-commit.ci

Support auto device mapping #781

Are you sure you want to change the base?

Support auto device mapping #781

Conversation

Kaihui-intel commented Sep 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wenhuach21 Sep 2, 2025 •

edited

Loading

wenhuach21 Sep 4, 2025 •

edited

Loading

wenhuach21 Sep 4, 2025 •

edited

Loading