-
Notifications
You must be signed in to change notification settings - Fork 52
Support auto device mapping #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
auto_round/utils.py
Outdated
# Calculate all block linear memory except for the second modulelist | ||
total_linear_memory = 0 | ||
for n, m in model.named_modules(): | ||
if hasattr(type(m), "__name__") and "ModuleList" in type(m).__name__: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call get_block_names
auto_round/utils.py
Outdated
for n, m in model.named_modules(): | ||
if hasattr(type(m), "__name__") and "ModuleList" in type(m).__name__: | ||
for name, module in m[-1].named_modules(): | ||
if isinstance(module, torch.nn.Linear): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conv1d is also supportd
auto_round/utils.py
Outdated
""" | ||
total_memory = bytes_to_gigabytes(torch.cuda.get_device_properties(i).total_memory) | ||
reserved_memory = bytes_to_gigabytes(torch.cuda.memory_reserved(i)) | ||
allocated_memory = bytes_to_gigabytes(torch.cuda.memory_allocated(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to support xpu too. For now, you could raise an exception that xpu does not devcie_map="auto"
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/utils.py
Outdated
all_blocks = get_block_names(model) | ||
m = get_module(model, all_blocks[0][-1]) | ||
for name, module in m.named_modules(): | ||
if isinstance(module, (torch.nn.Linear, transformers.pytorch_utils.Conv1D)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use SUPPORTED DTYPES
auto_round/utils.py
Outdated
sum(p.numel() for p in module.parameters()) * module.weight.element_size() | ||
) # Assuming parameters are float32 (4 bytes each) | ||
block_memory += param_size | ||
block_memory = block_memory / 1024**3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for vlms, there may be different memory for different blocks. Why not porting the code to quant_blocks function
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/autoround.py
Outdated
@@ -217,6 +216,7 @@ def __init__( | |||
disable_deterministic_algorithms = kwargs.pop("disable_deterministic_algorithms", False) | |||
static_kv_dtype = kwargs.pop("static_kv_dtype", None) | |||
self.vlm = kwargs.pop("vlm") if "vlm" in kwargs else False | |||
self.mem_expansion_factor = kwargs.pop("mem_expansion_factor", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ram_per_param_scale? and better have a comment to show the meaning of this variable
auto_round/autoround.py
Outdated
"""Automatically sets the device map for the model based on available GPUs and memory constraints.""" | ||
num_gpus = torch.cuda.device_count() - 1 | ||
if num_gpus == 0: | ||
def get_block_info(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you suggest a more precise name, preferably one that includes ‘mem’?
auto_round/autoround.py
Outdated
tensors of the first block, assuming bfloat16 or float32 precision. | ||
""" | ||
# Calculate all block linear memory | ||
total_linear_memory = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_param_mem?
auto_round/autoround.py
Outdated
if self.low_gpu_mem_usage: | ||
return block_memory, 0 | ||
|
||
# assuming bfloat16 or float32, input and output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper case
auto_round/autoround.py
Outdated
device_memory[cuda_devices[device_idx]] -= layer_memory * mem_expansion_factor | ||
if device_idx >= len(cuda_devices): | ||
raise ValueError( | ||
f"model is too large to fit in {num_gpus} GPUs, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, for device 0, we use the mem_expansion_factor, for other devices, we just split the remaining parameters. If it's more than the layer_memory * mem_expansion_factor, logger a warning but not an exception
auto_round/autoround.py
Outdated
if self.device_map == "auto": | ||
self.set_auto_device_map_in_block(block, input_ids) | ||
|
||
|
||
if self.device_map is not None: | ||
from accelerate import dispatch_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remember to support this in this scenario auto-round --model xxx --devices 0,1,2
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/autoround.py
Outdated
@@ -506,39 +507,34 @@ def _set_device_for_matching_module(self, name: str, device: str) -> None: | |||
else: | |||
module.tuning_device = device | |||
|
|||
def get_block_info(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]: | |||
def get_block_mem(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> tuple[float, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimate_tuning_block_mem , predict_tuning_block_mem or something like that
auto_round/autoround.py
Outdated
logger.warning( | ||
f"Layer {layer_name} may not fit in available GPU memory. " | ||
"Consider lowering ram_per_param_scale, using more GPUs, " | ||
"or reducing model size." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove reducing model size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using more GPUs or reducing mem_per_param_scale if OOM occurs.
Besides, you need to add one arg mem_per_param_scale in llm.py
auto_round/autoround.py
Outdated
device_map[layer_name] = device_idx | ||
device_memory[device_idx] -= layer_memory * ram_per_param_scale | ||
else: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use warning_once?
auto_round/autoround.py
Outdated
if self.low_gpu_mem_usage: | ||
return block_memory, 0 | ||
|
||
# assuming bfloat16 or float32, input and output | ||
# Assuming bfloat16 or float32, input and output | ||
input_bytes = 2 if self.amp_dtype != torch.float32 else 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_id[0] should have dtype
Signed-off-by: Kaihui-intel <[email protected]>
auto_round/autoround.py
Outdated
@@ -2460,6 +2549,10 @@ def _quantize_block( | |||
new_layer = convert_fp8_layer_to_linear(m, self.amp_dtype).to(device) | |||
set_module(block, n, new_layer) | |||
|
|||
if self.device_map == "auto": | |||
self.set_auto_device_map_in_block(block, input_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_set?
auto_round/autoround.py
Outdated
|
||
def set_auto_device_map_in_block(self, block, input_ids, supported_types=SUPPORTED_LAYER_TYPES) -> None: | ||
"""Automatically sets the device map for the block based on available GPUs and memory constraints.""" | ||
num_gpus = torch.cuda.device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to check whether it is cuda, if it is device like xpu, we should logger a warning and try to use device 0
Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: Kaihui-intel <[email protected]>
for more information, see https://pre-commit.ci
Usage