-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support W4A8 method of AngleSlim tool #6857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds Angelslim quantization config support, extends HF quant config parsing (new enums/fields/mappings), introduces ActivationScheme and exclude_quant_config in QuantConfig, adjusts excluded-module quant overrides, tightens DeepseekV3 kv_b_proj dequant condition, and overhauls fused MoE weight loading to be weight-name agnostic with FP8/FP4 per-expert scale handling and SM100-specific paths. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ModelConfig
participant FS as Filesystem
participant Parser
User->>ModelConfig: from_pretrained(model_dir)
ModelConfig->>FS: check angelslim_hf_quant_config.json
alt Angelslim config exists
ModelConfig->>Parser: load_angelslim_quant_config(file)
Parser-->>ModelConfig: QuantConfig, layer_quant_config=None
else
ModelConfig->>FS: check hf_quant_config.json
ModelConfig->>Parser: load_hf_quant_config(file)
Parser-->>ModelConfig: QuantConfig (incl. activation_scheme/exclude_quant_config)
end
ModelConfig-->>User: Model with QuantConfig
sequenceDiagram
participant Caller
participant FusedMoE
participant Weights as Weights Dict
participant Backend
Caller->>FusedMoE: load_weights(module, weights, mode, weight_name)
FusedMoE->>Weights: read expert.w{1,3,2}.{weight_name}
alt FP8 QDQ path
FusedMoE->>FusedMoE: compute per-expert scales/alphas (w1/w3/w2)
FusedMoE->>FusedMoE: interleave/normalize scales
opt SM100
FusedMoE->>FusedMoE: resmooth weight_scale_inv into weight
end
end
FusedMoE->>Backend: set tensors/scales
FusedMoE-->>Caller: done
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1361-1368
: Use fully-qualified module name for exclusion check; current code can yield false negativesis_module_excluded_from_quantization expects the full module path for pattern matching. Passing only names[-1] ("kv_b_proj") can miss exclusions like "model.layers.12.self_attn.kv_b_proj". Also, your new gating to skip dequant when exclude_quant_config is provided makes sense; keep it, but fix the name passed.
Apply this diff to pass the fully-qualified module name:
- dequant_kv_b_proj = self.model_config.quant_config.is_module_excluded_from_quantization( - names[-1]) and self.model_config.quant_config.exclude_quant_config is None + dequant_kv_b_proj = ( + self.model_config.quant_config.is_module_excluded_from_quantization(name) + and self.model_config.quant_config.exclude_quant_config is None + )tensorrt_llm/_torch/models/modeling_utils.py (1)
476-488
: Preserve base QuantConfig when overriding for excluded modules; add None-safetyThe new per-excluded override only sets quant_algo and activation_scheme and loses other QuantConfig fields (e.g., group_size, clamp_val, has_zero_point). Build the override from the existing QuantConfig to preserve defaults and runtime semantics. Also, guard against a None quant_config.
Apply this diff:
- quant_algo = None - activation_scheme = None - exclude_quant_config = quant_config.exclude_quant_config - if exclude_quant_config: - quant_algo = exclude_quant_config.get("quant_algo", None) - activation_scheme = exclude_quant_config.get("activation_scheme", None) - new_config = QuantConfig( - quant_algo=quant_algo, kv_cache_quant_algo=kv_cache_quant_algo, activation_scheme=activation_scheme) + quant_algo = None + activation_scheme = None + exclude_quant_cfg = getattr(quant_config, "exclude_quant_config", None) + if exclude_quant_cfg: + quant_algo = exclude_quant_cfg.get("quant_algo") + activation_scheme = exclude_quant_cfg.get("activation_scheme") + # Preserve all other QuantConfig fields while overriding specific attributes + base_config = quant_config or QuantConfig() + new_config = dataclass_replace( + base_config, + quant_algo=quant_algo, + activation_scheme=activation_scheme, + kv_cache_quant_algo=kv_cache_quant_algo, + )And add this import at the top of the file (outside this hunk):
from dataclasses import replace as dataclass_replace
🧹 Nitpick comments (3)
tensorrt_llm/quantization/mode.py (1)
463-466
: Enum addition looks good; consider adding a brief docstring for clarityActivationScheme is correctly defined as a StrEnum with BaseEnumMeta and is ready for serialization and validation. A short docstring would improve readability.
Apply this diff to add a concise docstring:
class ActivationScheme(StrEnum, metaclass=BaseEnumMeta): - STATIC = auto() - DYNAMIC = auto() + """Activation quantization scheme.""" + STATIC = auto() + DYNAMIC = auto()tensorrt_llm/models/modeling_utils.py (1)
143-145
: Tighten docstring wording and wrap long lines (fixes E501 and improves clarity)The current docstrings have grammar issues and exceed 120 chars. Reword and wrap to meet style and static-analysis guidance.
Apply this diff:
- exclude_quant_config (Dict, optional): The model of exclude_modules will use exclude_quant_config. - activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): The input of activation quantize scheme. + exclude_quant_config (Dict, optional): Per‑module quantization overrides applied to modules + matched by exclude_modules. Only the provided fields are overridden (e.g., quant_algo, + kv_cache_quant_algo, activation_scheme). + activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): Activation + quantization scheme (e.g., STATIC or DYNAMIC).tensorrt_llm/_torch/model_config.py (1)
267-279
: Simplify nested dictionary initialization for exclude_quantization.The nested ternary operations make the code hard to read and maintain. Consider extracting this into a helper function for better clarity.
Extract the logic into a helper function:
+def _parse_exclude_quantization(json_exclude_quant_configs): + if not json_exclude_quant_configs: + return None + + result = {} + if json_exclude_quant_configs.get('quant_algo'): + result['quant_algo'] = QuantAlgo(json_exclude_quant_configs['quant_algo'].upper()) + else: + result['quant_algo'] = None + + if json_exclude_quant_configs.get('kv_cache_quant_algo'): + result['kv_cache_quant_algo'] = QuantAlgo(json_exclude_quant_configs['kv_cache_quant_algo'].upper()) + else: + result['kv_cache_quant_algo'] = None + + if json_exclude_quant_configs.get('activation_scheme'): + result['activation_scheme'] = ActivationScheme(json_exclude_quant_configs['activation_scheme'].upper()) + else: + result['activation_scheme'] = None + + return result json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) -if json_exclude_quant_configs: - quant_config.exclude_quant_config = { - "quant_algo": QuantAlgo( - json_exclude_quant_configs.get('quant_algo', None).upper() - ) if json_exclude_quant_configs.get("quant_algo") else None, - "kv_cache_quant_algo": QuantAlgo( - json_exclude_quant_configs.get("kv_cache_quant_algo").upper() - ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, - "activation_scheme": ActivationScheme( - json_exclude_quant_configs.get('activation_scheme', None).upper() - ) if json_exclude_quant_configs.get("activation_scheme") else None, - } +quant_config.exclude_quant_config = _parse_exclude_quantization(json_exclude_quant_configs)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/model_config.py
(4 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(1 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py
(6 hunks)tensorrt_llm/llmapi/llm_utils.py
(2 hunks)tensorrt_llm/models/modeling_utils.py
(3 hunks)tensorrt_llm/quantization/mode.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧬 Code Graph Analysis (7)
tensorrt_llm/quantization/mode.py (1)
tensorrt_llm/_utils.py (1)
BaseEnumMeta
(773-780)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
quant_algo
(550-551)QuantConfig
(128-271)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/models/modeling_utils.py (1)
tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/models/modeling_utils.py (2)
QuantConfig
(128-271)quant_algo
(550-551)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/models/modeling_utils.py (3)
PretrainedConfig
(369-570)QuantConfig
(128-271)quant_algo
(550-551)tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
MoEWeightLoadingMode
(13-15)tests/unittest/_torch/modules/test_fused_moe.py (1)
load_weights
(1660-1725)tensorrt_llm/_torch/utils.py (1)
shape
(103-104)
🪛 Ruff (0.12.2)
tensorrt_llm/models/modeling_utils.py
143-143: Line too long (127 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm_utils.py (1)
438-474
: Manual verification needed for kv_cache_quant_method & activation_scheme enum valuesIt looks like no
quantization_config.kv_cache_quant_method
or.activation_scheme
entries were found in the repo’s JSON files—this may simply mean that those fields live in user‐supplied Hugging Face configs or are generated at runtime. Before merging the enum-validation refactor, please:
- Confirm which enum members actually appear in your external HF configs or examples (e.g. via your model hubs or deployment manifests).
- Ensure the hardcoded “Expected one of” lists (INT8, FP8, NVFP4 and STATIC, DYNAMIC) cover every real-world value your users might supply.
- Update the error messages or docstrings accordingly if there are additional valid enum values.
tensorrt_llm/_torch/model_config.py (2)
18-18
: LGTM! Import addition for ActivationScheme is appropriate.The addition of
ActivationScheme
to the import statement is necessary for the new functionality.
326-330
: Good addition of W4A8_AWQ support with proper error handling.The new case for
"w4a8_awq"
correctly maps toQuantAlgo.W4A8_AWQ
, and the addition ofNotImplementedError
for unsupported quant methods improves error handling.tensorrt_llm/_torch/modules/fused_moe/quantization.py (4)
209-210
: Good addition of weight_name parameter for flexible weight loading.The addition of the
weight_name
parameter with a default value of"weight"
provides backward compatibility while enabling support for different weight types.
218-220
: LGTM! Proper use of weight_name in expert weight keys.The implementation correctly uses the
weight_name
parameter to construct the expert weight keys, allowing for flexible weight type selection.
959-963
: Good override maintaining backward compatibility.The
WInt4AFP8FusedMoEMethod.load_weights
override withweight_name: str = "qweight"
properly handles quantized weights while maintaining the interface contract.
985-991
: Per-expert FP8 scale handling is correct
All of the expected"{expert_id}.w1.weight_scale"
and"{expert_id}.w3.weight_scale"
keys are present in the tests, and the code’s element-wise max → stack → reshape logic aligns with the shape ofmodule.fc31_alpha
. No missing keys or shape mismatches were found.
@staticmethod | ||
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | ||
quant_config = QuantConfig() | ||
layer_quant_config = None | ||
|
||
with open(quant_config_file) as f: | ||
quant_config_dict = json.load(f) | ||
|
||
json_quant_configs = quant_config_dict['quantization'] | ||
|
||
quant_config.quant_algo = QuantAlgo( | ||
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | ||
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | ||
if quant_config.quant_algo == "fp8_pb_wo": | ||
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | ||
|
||
quant_config.kv_cache_quant_algo = QuantAlgo( | ||
json_quant_configs.get("kv_cache_quant_algo").upper() | ||
) if json_quant_configs.get("kv_cache_quant_algo") else None | ||
quant_config.group_size = json_quant_configs.get('group_size', None) | ||
quant_config.exclude_modules = json_quant_configs.get( | ||
'exclude_modules', None) | ||
quant_config.activation_scheme = ActivationScheme( | ||
json_quant_configs.get('activation_scheme', None).upper() | ||
) if json_quant_configs.get("activation_scheme") else None | ||
|
||
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | ||
if json_exclude_quant_configs: | ||
quant_config.exclude_quant_config = { | ||
"quant_algo": QuantAlgo( | ||
json_exclude_quant_configs.get('quant_algo', None).upper() | ||
) if json_exclude_quant_configs.get("quant_algo") else None, | ||
"kv_cache_quant_algo": QuantAlgo( | ||
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | ||
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | ||
"activation_scheme": ActivationScheme( | ||
json_exclude_quant_configs.get('activation_scheme', None).upper() | ||
) if json_exclude_quant_configs.get("activation_scheme") else None, | ||
} | ||
return quant_config, layer_quant_config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for malformed JSON config files.
The new load_angelslim_quant_config
method assumes the JSON file is well-formed and contains the expected structure. Add error handling for cases where required keys might be missing or the JSON is malformed.
Consider adding validation like this:
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
- with open(quant_config_file) as f:
- quant_config_dict = json.load(f)
+ try:
+ with open(quant_config_file) as f:
+ quant_config_dict = json.load(f)
+ except (json.JSONDecodeError, IOError) as e:
+ raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
- json_quant_configs = quant_config_dict['quantization']
+ json_quant_configs = quant_config_dict.get('quantization', {})
+ if not json_quant_configs:
+ raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@staticmethod | |
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | |
quant_config = QuantConfig() | |
layer_quant_config = None | |
with open(quant_config_file) as f: | |
quant_config_dict = json.load(f) | |
json_quant_configs = quant_config_dict['quantization'] | |
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
quant_config.kv_cache_quant_algo = QuantAlgo( | |
json_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_quant_configs.get("kv_cache_quant_algo") else None | |
quant_config.group_size = json_quant_configs.get('group_size', None) | |
quant_config.exclude_modules = json_quant_configs.get( | |
'exclude_modules', None) | |
quant_config.activation_scheme = ActivationScheme( | |
json_quant_configs.get('activation_scheme', None).upper() | |
) if json_quant_configs.get("activation_scheme") else None | |
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quant_configs: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get('quant_algo', None).upper() | |
) if json_exclude_quant_configs.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quant_configs.get('activation_scheme', None).upper() | |
) if json_exclude_quant_configs.get("activation_scheme") else None, | |
} | |
return quant_config, layer_quant_config | |
@staticmethod | |
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | |
quant_config = QuantConfig() | |
layer_quant_config = None | |
try: | |
with open(quant_config_file) as f: | |
quant_config_dict = json.load(f) | |
except (json.JSONDecodeError, IOError) as e: | |
raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") | |
json_quant_configs = quant_config_dict.get('quantization', {}) | |
if not json_quant_configs: | |
raise ValueError(f"Missing 'quantization' section in {quant_config_file}") | |
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper() | |
) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
quant_config.kv_cache_quant_algo = QuantAlgo( | |
json_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_quant_configs.get("kv_cache_quant_algo") else None | |
quant_config.group_size = json_quant_configs.get('group_size', None) | |
quant_config.exclude_modules = json_quant_configs.get( | |
'exclude_modules', None) | |
quant_config.activation_scheme = ActivationScheme( | |
json_quant_configs.get('activation_scheme', None).upper() | |
) if json_quant_configs.get("activation_scheme") else None | |
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quant_configs: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get('quant_algo', None).upper() | |
) if json_exclude_quant_configs.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quant_configs.get('activation_scheme', None).upper() | |
) if json_exclude_quant_configs.get("activation_scheme") else None, | |
} | |
return quant_config, layer_quant_config |
quant_config.quant_algo = QuantAlgo( | ||
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | ||
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | ||
if quant_config.quant_algo == "fp8_pb_wo": | ||
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical issue: string comparison against enum value.
Line 254 compares quant_config.quant_algo
(which is now a QuantAlgo
enum) against the string "fp8_pb_wo"
. This will always fail because you're comparing an enum to a string.
Fix the comparison:
- quant_config.quant_algo = QuantAlgo(
- json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
- # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
- if quant_config.quant_algo == "fp8_pb_wo":
- quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+ algo_str = json_quant_configs.get('quant_algo')
+ if algo_str:
+ algo_str = algo_str.upper()
+ # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+ if algo_str == "FP8_PB_WO":
+ quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+ else:
+ quant_config.quant_algo = QuantAlgo(algo_str)
+ else:
+ quant_config.quant_algo = None
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
algo_str = json_quant_configs.get('quant_algo') | |
if algo_str: | |
algo_str = algo_str.upper() | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if algo_str == "FP8_PB_WO": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
else: | |
quant_config.quant_algo = QuantAlgo(algo_str) | |
else: | |
quant_config.quant_algo = None |
🤖 Prompt for AI Agents
tensorrt_llm/_torch/model_config.py around lines 251 to 255: the code currently
compares the QuantAlgo enum instance to the string "fp8_pb_wo", which always
fails; change the check to compare the enum's name or value to that string (for
example: if quant_config.quant_algo.name.lower() == "fp8_pb_wo") and then set
quant_config.quant_algo = QuantAlgo("FP8_BLOCK_SCALES") as before.
elif (quant_config_file := model_dir / 'angelslim_hf_quant_config.json').exists(): | ||
quant_config, layer_quant_config = cls.load_angelslim_quant_config( | ||
quant_config_file, model_dir, moe_backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the consistency of the naming convention for config files.
The code now supports both angelslim_hf_quant_config.json
and hf_quant_config.json
. Ensure this naming is consistent with the tool's documentation and expected usage.
🏁 Script executed:
#!/bin/bash
# Check for any existing references to these config file names in the codebase
echo "Searching for references to angelslim_hf_quant_config.json:"
rg "angelslim_hf_quant_config" --type py -A 2 -B 2
echo -e "\nSearching for references to hf_quant_config.json:"
rg "hf_quant_config\.json" --type py -A 2 -B 2
echo -e "\nChecking for any documentation about these config files:"
fd -e md -e rst | xargs rg -l "angelslim_hf_quant_config|hf_quant_config"
Length of output: 8264
Update documentation to include the new AngelsLim config filename
The code now recognizes both hf_quant_config.json
and angelslim_hf_quant_config.json
, but our docs and examples only mention the former. Please update all references so users know both are supported:
• docs/source/performance/perf-benchmarking.md
• examples/models/core/deepseek_v3/README.md
– Add a note explaining that if an angelslim_hf_quant_config.json
is present in the model directory, it will be loaded via load_angelslim_quant_config()
– Ensure any code snippets or CLI examples show both filenames where applicable
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 438 to 440, the code now
accepts an additional AngelsLim config filename
(angelslim_hf_quant_config.json); update the documentation and example README to
mention both hf_quant_config.json and angelslim_hf_quant_config.json. Edit
docs/source/performance/perf-benchmarking.md and
examples/models/core/deepseek_v3/README.md to add a short note that if
angelslim_hf_quant_config.json exists in the model directory it will be loaded
via load_angelslim_quant_config(), and update any code snippets or CLI examples
to show both filenames where applicable (e.g., list both filenames in examples
and usage text).
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) | ||
all_w1_scales = [ | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"], | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"], | ||
module.tp_size, | ||
module.tp_rank, | ||
TensorParallelMode.COLUMN, | ||
device=self.device) | ||
for expert_id in module.initial_local_expert_ids | ||
] | ||
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) | ||
all_w3_w1_scales = torch.cat( | ||
[torch.stack(all_w3_scales), | ||
torch.stack(all_w1_scales)], dim=-2) | ||
[all_w3_scales, | ||
all_w1_scales], dim=-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential division by zero when scales are zero.
The code divides by all_w3_w1_scales_fp8_max
and all_w2_scales_fp8
without checking if these values are zero. This could lead to runtime errors or inf/nan values.
Add validation for zero scales:
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
+if torch.any(all_w3_w1_scales_fp8_max == 0):
+ raise ValueError("Found zero FP8 weight scales, which would cause division by zero")
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w2_scales = torch.stack(all_w2_scales) / all_w2_scales_fp8.unsqueeze(2)
+if torch.any(all_w2_scales_fp8 == 0):
+ raise ValueError("Found zero FP8 weight scales for w2, which would cause division by zero")
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around lines 1002 to
1014, the code divides tensors by all_w3_w1_scales_fp8_max and all_w2_scales_fp8
without guarding against zeros; add a defensive check to avoid divide-by-zero by
replacing zero (or near-zero) elements with a small safe epsilon (e.g., 1e-6) or
use torch.clamp_min / torch.where to ensure the denominators are >= epsilon
before performing the divisions, and optionally log or assert if any
replacements occurred to aid debugging.
all_w1_scales = [ | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"], | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent weight scale key naming: weight_scale.int4 vs weight_scale_inv.
Line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while line 995 uses f"{expert_id}.w3.weight_scale_inv"
. This inconsistency could indicate different weight formats or be a bug.
Please verify if this naming difference is intentional:
#!/bin/bash
# Check for both naming patterns in the codebase
echo "Searching for weight_scale.int4 pattern:"
rg "weight_scale\.int4" --type py -B 2 -A 2
echo -e "\nSearching for weight_scale_inv pattern:"
rg "weight_scale_inv" --type py -B 2 -A 2
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around line 1004 there
is an inconsistent key name: line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while earlier (line ~995) uses f"{expert_id}.w3.weight_scale_inv"; confirm
whether the correct stored key is weight_scale_inv or weight_scale.int4 by
searching the repo for both patterns, then make the keys consistent (prefer
using the canonical weight_scale_inv if other shards use that naming), update
the load_weight_shard call to use the canonical key across all shards, and add a
brief inline comment explaining the chosen convention so future readers know
which format is expected.
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.