Skip to content

Quantization Memory Requirements #1228

@sneha5gsm

Description

@sneha5gsm

Hello!

I was trying the various quantization recipes for quantizing a 70B Llama 3 based model to FP8, INT8, INT4(A16) precisions as mentioned in the quantization docs by vLLM.

  1. Could you help me understand the memory requirements for the quantization recipes, i.e SmoothQuant (SmoothQuantModifier), GPTQ (GPTQModifier) and RTN (QuantizationModifier). A calculation/formula would help, for example, like the one we have for calculating kv cache:
memory in bytes for kv cache = 80 (layers) * 8 (kv heads) * 128 (head_dim) * 8192 (seq length) * 2 (k and v) * 2 (fp16)

I understand that the calculate_offload_device_map creates a custom device map by reserving memory for
GPTQ (reserve_for_hessians), but I would still like to understand the memory requirements to be able to utilize the GPU memory efficiently, to understand where all the GPU memory is consumed and to ensure that there are no bugs.

  1. Also, I understand that currently, for quantization of big models, the model is split in a pipeline parallel way on multiple GPUs available on the instance.
  • Since the GPU which is being used at any given time is the one which has the model layer that is being quantized at that time, would the time taken to quantize the model be similar to using a single GPU to quantize the model vs using multiple GPUs?
  • Is it possible to split the model in a tensor parallel way?
  • I understand that 'non-sequential GPTQ ' is deprecated, but how much memory is required for a non-sequential GPTQ? I think the above memory calculation would help. Also, how much speed up would we see using the non-sequential approach (compared to the sequential one)?

Thank you!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions