Skip to content

Support W4A8 method of AngleSlim tool #6857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bppan
Copy link

@bppan bppan commented Aug 13, 2025

Summary by CodeRabbit

  • New Features
    • Added support for Angelslim quantization configs when loading pretrained models.
    • Expanded HF quantization config support: w4a8_awq, kv-cache quantization, activation schemes (static/dynamic), module exclusions, and per-module override configs.
    • Configurable weight name in MoE backends for broader compatibility (supports qweight/weight paths).
    • Enhanced FP8/FP4 per-expert scale handling and SM100-specific resmoothing for improved stability and performance.
    • Added logging of detected quantization settings.
  • Bug Fixes
    • Corrected dequantization handling for certain DeepSeek V3 weights to properly respect exclusion settings.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@bppan bppan requested review from a team as code owners August 13, 2025 08:18
Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

📝 Walkthrough

Walkthrough

Adds Angelslim quantization config support, extends HF quant config parsing (new enums/fields/mappings), introduces ActivationScheme and exclude_quant_config in QuantConfig, adjusts excluded-module quant overrides, tightens DeepseekV3 kv_b_proj dequant condition, and overhauls fused MoE weight loading to be weight-name agnostic with FP8/FP4 per-expert scale handling and SM100-specific paths.

Changes

Cohort / File(s) Summary of Changes
Quant schema extensions
tensorrt_llm/models/modeling_utils.py, tensorrt_llm/quantization/mode.py
Added ActivationScheme enum; extended QuantConfig with activation_scheme and exclude_quant_config fields and typing/docs.
HF/Angelslim quant config loading
tensorrt_llm/_torch/model_config.py, tensorrt_llm/llmapi/llm_utils.py
Added Angelslim config loader and dispatch; expanded HF quant parsing: new quant_method mapping (w4a8_awq), kv_cache_quant_method, activation_scheme, ignored_modules merging, ignored_quantization_config to exclude_quant_config; logging and stricter unsupported-method handling.
Exclude-quant overrides application
tensorrt_llm/_torch/models/modeling_utils.py
When exclude_quant_config exists, build per-excluded-module QuantConfig with specified quant_algo/activation_scheme while preserving kv_cache_quant_algo; otherwise keep prior behavior.
DeepseekV3 kv_b_proj gating
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Dequantization now requires module exclusion and exclude_quant_config is None.
Fused MoE weight loading overhaul
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Introduced weight_name across loaders (defaulting to weight/qweight); reworked FP8 QDQ per-expert scale/alpha computation and interleaving; added SM100 resmoothing of weight_scale_inv; updated backend overrides to accept weight_name; maintained backward compatibility via defaults.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant ModelConfig
  participant FS as Filesystem
  participant Parser

  User->>ModelConfig: from_pretrained(model_dir)
  ModelConfig->>FS: check angelslim_hf_quant_config.json
  alt Angelslim config exists
    ModelConfig->>Parser: load_angelslim_quant_config(file)
    Parser-->>ModelConfig: QuantConfig, layer_quant_config=None
  else
    ModelConfig->>FS: check hf_quant_config.json
    ModelConfig->>Parser: load_hf_quant_config(file)
    Parser-->>ModelConfig: QuantConfig (incl. activation_scheme/exclude_quant_config)
  end
  ModelConfig-->>User: Model with QuantConfig
Loading
sequenceDiagram
  participant Caller
  participant FusedMoE
  participant Weights as Weights Dict
  participant Backend

  Caller->>FusedMoE: load_weights(module, weights, mode, weight_name)
  FusedMoE->>Weights: read expert.w{1,3,2}.{weight_name}
  alt FP8 QDQ path
    FusedMoE->>FusedMoE: compute per-expert scales/alphas (w1/w3/w2)
    FusedMoE->>FusedMoE: interleave/normalize scales
    opt SM100
      FusedMoE->>FusedMoE: resmooth weight_scale_inv into weight
    end
  end
  FusedMoE->>Backend: set tensors/scales
  FusedMoE-->>Caller: done
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • achartier
  • hlu1
  • Tracin
  • Superjomn
  • litaotju
  • yuxianq
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1361-1368: Use fully-qualified module name for exclusion check; current code can yield false negatives

is_module_excluded_from_quantization expects the full module path for pattern matching. Passing only names[-1] ("kv_b_proj") can miss exclusions like "model.layers.12.self_attn.kv_b_proj". Also, your new gating to skip dequant when exclude_quant_config is provided makes sense; keep it, but fix the name passed.

Apply this diff to pass the fully-qualified module name:

-                    dequant_kv_b_proj = self.model_config.quant_config.is_module_excluded_from_quantization(
-                        names[-1]) and self.model_config.quant_config.exclude_quant_config is None
+                    dequant_kv_b_proj = (
+                        self.model_config.quant_config.is_module_excluded_from_quantization(name)
+                        and self.model_config.quant_config.exclude_quant_config is None
+                    )
tensorrt_llm/_torch/models/modeling_utils.py (1)

476-488: Preserve base QuantConfig when overriding for excluded modules; add None-safety

The new per-excluded override only sets quant_algo and activation_scheme and loses other QuantConfig fields (e.g., group_size, clamp_val, has_zero_point). Build the override from the existing QuantConfig to preserve defaults and runtime semantics. Also, guard against a None quant_config.

Apply this diff:

-        quant_algo = None
-        activation_scheme = None
-        exclude_quant_config = quant_config.exclude_quant_config
-        if exclude_quant_config:
-            quant_algo = exclude_quant_config.get("quant_algo", None)
-            activation_scheme = exclude_quant_config.get("activation_scheme", None)
-        new_config = QuantConfig(
-            quant_algo=quant_algo, kv_cache_quant_algo=kv_cache_quant_algo, activation_scheme=activation_scheme)
+        quant_algo = None
+        activation_scheme = None
+        exclude_quant_cfg = getattr(quant_config, "exclude_quant_config", None)
+        if exclude_quant_cfg:
+            quant_algo = exclude_quant_cfg.get("quant_algo")
+            activation_scheme = exclude_quant_cfg.get("activation_scheme")
+        # Preserve all other QuantConfig fields while overriding specific attributes
+        base_config = quant_config or QuantConfig()
+        new_config = dataclass_replace(
+            base_config,
+            quant_algo=quant_algo,
+            activation_scheme=activation_scheme,
+            kv_cache_quant_algo=kv_cache_quant_algo,
+        )

And add this import at the top of the file (outside this hunk):

from dataclasses import replace as dataclass_replace
🧹 Nitpick comments (3)
tensorrt_llm/quantization/mode.py (1)

463-466: Enum addition looks good; consider adding a brief docstring for clarity

ActivationScheme is correctly defined as a StrEnum with BaseEnumMeta and is ready for serialization and validation. A short docstring would improve readability.

Apply this diff to add a concise docstring:

 class ActivationScheme(StrEnum, metaclass=BaseEnumMeta):
-    STATIC = auto()
-    DYNAMIC = auto()
+    """Activation quantization scheme."""
+    STATIC = auto()
+    DYNAMIC = auto()
tensorrt_llm/models/modeling_utils.py (1)

143-145: Tighten docstring wording and wrap long lines (fixes E501 and improves clarity)

The current docstrings have grammar issues and exceed 120 chars. Reword and wrap to meet style and static-analysis guidance.

Apply this diff:

-        exclude_quant_config  (Dict, optional): The model of exclude_modules will use exclude_quant_config.
-        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): The input of activation quantize scheme.
+        exclude_quant_config (Dict, optional): Per‑module quantization overrides applied to modules
+            matched by exclude_modules. Only the provided fields are overridden (e.g., quant_algo,
+            kv_cache_quant_algo, activation_scheme).
+        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): Activation
+            quantization scheme (e.g., STATIC or DYNAMIC).
tensorrt_llm/_torch/model_config.py (1)

267-279: Simplify nested dictionary initialization for exclude_quantization.

The nested ternary operations make the code hard to read and maintain. Consider extracting this into a helper function for better clarity.

Extract the logic into a helper function:

+def _parse_exclude_quantization(json_exclude_quant_configs):
+    if not json_exclude_quant_configs:
+        return None
+    
+    result = {}
+    if json_exclude_quant_configs.get('quant_algo'):
+        result['quant_algo'] = QuantAlgo(json_exclude_quant_configs['quant_algo'].upper())
+    else:
+        result['quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('kv_cache_quant_algo'):
+        result['kv_cache_quant_algo'] = QuantAlgo(json_exclude_quant_configs['kv_cache_quant_algo'].upper())
+    else:
+        result['kv_cache_quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('activation_scheme'):
+        result['activation_scheme'] = ActivationScheme(json_exclude_quant_configs['activation_scheme'].upper())
+    else:
+        result['activation_scheme'] = None
+    
+    return result

 json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
-if json_exclude_quant_configs:
-    quant_config.exclude_quant_config = {
-        "quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get('quant_algo', None).upper()
-        ) if json_exclude_quant_configs.get("quant_algo") else None,
-        "kv_cache_quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
-        ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
-        "activation_scheme": ActivationScheme(
-            json_exclude_quant_configs.get('activation_scheme', None).upper()
-        ) if json_exclude_quant_configs.get("activation_scheme") else None,
-    }
+quant_config.exclude_quant_config = _parse_exclude_quantization(json_exclude_quant_configs)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50e5e72 and 15fbe20.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/model_config.py (4 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (6 hunks)
  • tensorrt_llm/llmapi/llm_utils.py (2 hunks)
  • tensorrt_llm/models/modeling_utils.py (3 hunks)
  • tensorrt_llm/quantization/mode.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/models/modeling_utils.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/models/modeling_utils.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧬 Code Graph Analysis (7)
tensorrt_llm/quantization/mode.py (1)
tensorrt_llm/_utils.py (1)
  • BaseEnumMeta (773-780)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
  • quant_algo (550-551)
  • QuantConfig (128-271)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/models/modeling_utils.py (1)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/models/modeling_utils.py (2)
  • QuantConfig (128-271)
  • quant_algo (550-551)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/models/modeling_utils.py (3)
  • PretrainedConfig (369-570)
  • QuantConfig (128-271)
  • quant_algo (550-551)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • MoEWeightLoadingMode (13-15)
tests/unittest/_torch/modules/test_fused_moe.py (1)
  • load_weights (1660-1725)
tensorrt_llm/_torch/utils.py (1)
  • shape (103-104)
🪛 Ruff (0.12.2)
tensorrt_llm/models/modeling_utils.py

143-143: Line too long (127 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm_utils.py (1)

438-474: Manual verification needed for kv_cache_quant_method & activation_scheme enum values

It looks like no quantization_config.kv_cache_quant_method or .activation_scheme entries were found in the repo’s JSON files—this may simply mean that those fields live in user‐supplied Hugging Face configs or are generated at runtime. Before merging the enum-validation refactor, please:

  • Confirm which enum members actually appear in your external HF configs or examples (e.g. via your model hubs or deployment manifests).
  • Ensure the hardcoded “Expected one of” lists (INT8, FP8, NVFP4 and STATIC, DYNAMIC) cover every real-world value your users might supply.
  • Update the error messages or docstrings accordingly if there are additional valid enum values.
tensorrt_llm/_torch/model_config.py (2)

18-18: LGTM! Import addition for ActivationScheme is appropriate.

The addition of ActivationScheme to the import statement is necessary for the new functionality.


326-330: Good addition of W4A8_AWQ support with proper error handling.

The new case for "w4a8_awq" correctly maps to QuantAlgo.W4A8_AWQ, and the addition of NotImplementedError for unsupported quant methods improves error handling.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (4)

209-210: Good addition of weight_name parameter for flexible weight loading.

The addition of the weight_name parameter with a default value of "weight" provides backward compatibility while enabling support for different weight types.


218-220: LGTM! Proper use of weight_name in expert weight keys.

The implementation correctly uses the weight_name parameter to construct the expert weight keys, allowing for flexible weight type selection.


959-963: Good override maintaining backward compatibility.

The WInt4AFP8FusedMoEMethod.load_weights override with weight_name: str = "qweight" properly handles quantized weights while maintaining the interface contract.


985-991: Per-expert FP8 scale handling is correct
All of the expected "{expert_id}.w1.weight_scale" and "{expert_id}.w3.weight_scale" keys are present in the tests, and the code’s element-wise max → stack → reshape logic aligns with the shape of module.fc31_alpha. No missing keys or shape mismatches were found.

Comment on lines +241 to +281
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None

with open(quant_config_file) as f:
quant_config_dict = json.load(f)

json_quant_configs = quant_config_dict['quantization']

quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None

json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for malformed JSON config files.

The new load_angelslim_quant_config method assumes the JSON file is well-formed and contains the expected structure. Add error handling for cases where required keys might be missing or the JSON is malformed.

Consider adding validation like this:

 def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
     quant_config = QuantConfig()
     layer_quant_config = None
 
-    with open(quant_config_file) as f:
-        quant_config_dict = json.load(f)
+    try:
+        with open(quant_config_file) as f:
+            quant_config_dict = json.load(f)
+    except (json.JSONDecodeError, IOError) as e:
+        raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
 
-    json_quant_configs = quant_config_dict['quantization']
+    json_quant_configs = quant_config_dict.get('quantization', {})
+    if not json_quant_configs:
+        raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
with open(quant_config_file) as f:
quant_config_dict = json.load(f)
json_quant_configs = quant_config_dict['quantization']
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
try:
with open(quant_config_file) as f:
quant_config_dict = json.load(f)
except (json.JSONDecodeError, IOError) as e:
raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
json_quant_configs = quant_config_dict.get('quantization', {})
if not json_quant_configs:
raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()
) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config

Comment on lines +251 to +255
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical issue: string comparison against enum value.

Line 254 compares quant_config.quant_algo (which is now a QuantAlgo enum) against the string "fp8_pb_wo". This will always fail because you're comparing an enum to a string.

Fix the comparison:

-    quant_config.quant_algo = QuantAlgo(
-        json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
-    # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
-    if quant_config.quant_algo == "fp8_pb_wo":
-        quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+    algo_str = json_quant_configs.get('quant_algo')
+    if algo_str:
+        algo_str = algo_str.upper()
+        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+        if algo_str == "FP8_PB_WO":
+            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+        else:
+            quant_config.quant_algo = QuantAlgo(algo_str)
+    else:
+        quant_config.quant_algo = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
algo_str = json_quant_configs.get('quant_algo')
if algo_str:
algo_str = algo_str.upper()
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if algo_str == "FP8_PB_WO":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
else:
quant_config.quant_algo = QuantAlgo(algo_str)
else:
quant_config.quant_algo = None
🤖 Prompt for AI Agents
tensorrt_llm/_torch/model_config.py around lines 251 to 255: the code currently
compares the QuantAlgo enum instance to the string "fp8_pb_wo", which always
fails; change the check to compare the enum's name or value to that string (for
example: if quant_config.quant_algo.name.lower() == "fp8_pb_wo") and then set
quant_config.quant_algo = QuantAlgo("FP8_BLOCK_SCALES") as before.

Comment on lines +438 to +440
elif (quant_config_file := model_dir / 'angelslim_hf_quant_config.json').exists():
quant_config, layer_quant_config = cls.load_angelslim_quant_config(
quant_config_file, model_dir, moe_backend)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the consistency of the naming convention for config files.

The code now supports both angelslim_hf_quant_config.json and hf_quant_config.json. Ensure this naming is consistent with the tool's documentation and expected usage.


🏁 Script executed:

#!/bin/bash
# Check for any existing references to these config file names in the codebase
echo "Searching for references to angelslim_hf_quant_config.json:"
rg "angelslim_hf_quant_config" --type py -A 2 -B 2

echo -e "\nSearching for references to hf_quant_config.json:"
rg "hf_quant_config\.json" --type py -A 2 -B 2

echo -e "\nChecking for any documentation about these config files:"
fd -e md -e rst | xargs rg -l "angelslim_hf_quant_config|hf_quant_config"

Length of output: 8264


Update documentation to include the new AngelsLim config filename

The code now recognizes both hf_quant_config.json and angelslim_hf_quant_config.json, but our docs and examples only mention the former. Please update all references so users know both are supported:

• docs/source/performance/perf-benchmarking.md
• examples/models/core/deepseek_v3/README.md

– Add a note explaining that if an angelslim_hf_quant_config.json is present in the model directory, it will be loaded via load_angelslim_quant_config()
– Ensure any code snippets or CLI examples show both filenames where applicable

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 438 to 440, the code now
accepts an additional AngelsLim config filename
(angelslim_hf_quant_config.json); update the documentation and example README to
mention both hf_quant_config.json and angelslim_hf_quant_config.json. Edit
docs/source/performance/perf-benchmarking.md and
examples/models/core/deepseek_v3/README.md to add a short note that if
angelslim_hf_quant_config.json exists in the model directory it will be loaded
via load_angelslim_quant_config(), and update any code snippets or CLI examples
to show both filenames where applicable (e.g., list both filenames in examples
and usage text).

Comment on lines +1002 to +1014
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w1_scales = [
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],
module.tp_size,
module.tp_rank,
TensorParallelMode.COLUMN,
device=self.device)
for expert_id in module.initial_local_expert_ids
]
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w3_w1_scales = torch.cat(
[torch.stack(all_w3_scales),
torch.stack(all_w1_scales)], dim=-2)
[all_w3_scales,
all_w1_scales], dim=-2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential division by zero when scales are zero.

The code divides by all_w3_w1_scales_fp8_max and all_w2_scales_fp8 without checking if these values are zero. This could lead to runtime errors or inf/nan values.

Add validation for zero scales:

 all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
+if torch.any(all_w3_w1_scales_fp8_max == 0):
+    raise ValueError("Found zero FP8 weight scales, which would cause division by zero")
 
 all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
 
 all_w2_scales = torch.stack(all_w2_scales) / all_w2_scales_fp8.unsqueeze(2)
+if torch.any(all_w2_scales_fp8 == 0):
+    raise ValueError("Found zero FP8 weight scales for w2, which would cause division by zero")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around lines 1002 to
1014, the code divides tensors by all_w3_w1_scales_fp8_max and all_w2_scales_fp8
without guarding against zeros; add a defensive check to avoid divide-by-zero by
replacing zero (or near-zero) elements with a small safe epsilon (e.g., 1e-6) or
use torch.clamp_min / torch.where to ensure the denominators are >= epsilon
before performing the divisions, and optionally log or assert if any
replacements occurred to aid debugging.

all_w1_scales = [
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent weight scale key naming: weight_scale.int4 vs weight_scale_inv.

Line 1004 uses f"{expert_id}.w1.weight_scale.int4" while line 995 uses f"{expert_id}.w3.weight_scale_inv". This inconsistency could indicate different weight formats or be a bug.

Please verify if this naming difference is intentional:

#!/bin/bash
# Check for both naming patterns in the codebase
echo "Searching for weight_scale.int4 pattern:"
rg "weight_scale\.int4" --type py -B 2 -A 2

echo -e "\nSearching for weight_scale_inv pattern:"
rg "weight_scale_inv" --type py -B 2 -A 2
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around line 1004 there
is an inconsistent key name: line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while earlier (line ~995) uses f"{expert_id}.w3.weight_scale_inv"; confirm
whether the correct stored key is weight_scale_inv or weight_scale.int4 by
searching the repo for both patterns, then make the keys consistent (prefer
using the canonical weight_scale_inv if other shards use that naming), update
the load_weight_shard call to use the canonical key across all shards, and add a
brief inline comment explaining the chosen convention so future readers know
which format is expected.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants