Skip to content

Conversation

yiqingy0
Copy link
Collaborator

@yiqingy0 yiqingy0 commented Aug 18, 2025

Summary by CodeRabbit

  • New Features

    • CI now runs DGX B300 (4‑GPU) PyTorch tests as part of the pre‑merge matrix.
  • Tests

    • Added an integration test group targeting 4‑GPU DGX B300 Ubuntu systems for multi‑GPU PyTorch deepseek scenarios.
  • Bug Fixes

    • Broadened GPU-type handling to include DGX‑B300 for dynamic driver flashing behavior.
  • Chores

    • CI shared-library reference updated to a forked source.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds DGX_B300 4-GPU test entries to the Jenkins L0 test matrix and updates pod GPU-type handling; introduces a new test-db YAML defining the l0_dgx_b300 pre-merge PyTorch multi-GPU test. Also changes a library load path to a forked bloom shared-lib reference.

Changes

Cohort / File(s) Change Summary
Jenkins CI pipeline
jenkins/L0_Test.groovy
- Updated library load path to reference a forked bloom-jenkins-shared-lib user path while retaining trtllm-jenkins-shared-lib@main.
- Extended GPU-type check to include dgx-b300 in createKubernetesPodConfig logic.
- Added new x86 test matrix entry DGX_B300-4_GPUs-PyTorch-1 in launchTestJobs.
Test DB addition
tests/integration/test_lists/test-db/l0_dgx_b300.yml
- Added v0.0.1 YAML defining l0_dgx_b300 group requiring exactly 4 GPUs (system_gpu_count gte/lte 4), GPU model wildcard *gb110*, Ubuntu distro, stage: pre_merge, backend: pytorch, and a PyTorch multi-GPU test (unittest/_torch/multi_gpu_modeling -k "deepseek").

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Jenkins as Jenkins L0 pipeline
  participant Matrix as launchTestJobs
  participant Cfg as createKubernetesPodConfig
  participant Kube as Kubernetes
  participant TestDB as Test DB (l0_dgx_b300.yml)

  Jenkins->>Matrix: build test matrix (includes DGX_B300-4_GPUs-PyTorch-1)
  Matrix->>Cfg: request pod config (arch, gpuType, gpuCount, image...)
  Cfg-->>Matrix: return pod spec (gpu-type check includes dgx-b300)
  Matrix->>Kube: launch job(s) with pod spec
  Kube->>TestDB: query tests for matching job (loads l0_dgx_b300.yml)
  Kube->>Kube: run tests (PyTorch multi-GPU)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • chzblych
  • tburt-nv
  • ruodil

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_b300.yml (1)

5-7: Double-check GPU count gate vs selected tests

The condition gates to exactly one GPU (gte: 1, lte: 1), while the selected test is from multi_gpu_modeling. If that test expects >1 GPUs on B300 nodes, this stage will systematically skip/xfail or under-exercise the path.

If needed, relax the GPU gate (e.g., gte: 2 for true multi-GPU) or pick a single-GPU-friendly sanity test for the initial B300 bring-up. Want a suggested minimal single-GPU PyTorch test list?

Also applies to: 15-17

jenkins/L0_Test.groovy (1)

657-679: Consider centralizing mapping and verify driver-flashing constraints for B300

Two follow-ups to keep things robust:

  • Prefer normalizing type to lowercase before checks to avoid case drift across call sites.
  • Verify whether B300 belongs in the “no dynamic driver flashing” bucket (the branch that omits nvidia.com/driver_version). If it does, include it alongside the existing types to avoid scheduling mismatches.

Proposed tweaks:

-        def gpuType = KubernetesManager.selectGPU(type)
-        if (type.contains("b300")) {
+        def normalizedType = type?.toLowerCase()
+        def gpuType = KubernetesManager.selectGPU(normalizedType)
+        if (normalizedType.contains("b300")) {
             gpuType = "NVIDIA_HGX_B300"
         }
@@
-        if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {
+        if (normalizedType.contains("dgx-h100") || normalizedType.contains("dgx-h200") ||
+            normalizedType in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"] ||
+            normalizedType.contains("b300") /* if B300 should avoid driver flashing */) {

If the driver-flashing rule doesn’t apply to B300, skip the last addition and keep only the normalization.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 55f4f2d and d3c1e22.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (2 hunks)
  • tests/integration/test_lists/test-db/l0_b300.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/test-db/l0_b300.yml (1)

8-11: linux_distribution_name placement is correct

Verified across all tests/integration/test_lists/test-db entries (e.g. l0_h100.yml, l0_perf.yml, etc.) that linux_distribution_name is always nested under wildcards. No change required.

jenkins/L0_Test.groovy (1)

657-659: LGTM: Explicit B300 gpuType override

Hard-setting B300 to "NVIDIA_HGX_B300" avoids mis-selection by the generic mapper. This is a pragmatic guardrail for early enablement.

Comment on lines 1846 to 1880
"B300_PCIe-PyTorch-1": ["b300", "l0_b300", 1, 1],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Stage wiring looks correct, but it depends on the test-db context fix

The new entry uses ["b300", "l0_b300", 1, 1], which aligns with other single-GPU entries and the naming convention used by filters (GPU token: B300_PCIe). However, the test-db file currently defines the context as l0_b200. Without fixing that, this stage will render an empty list.

After applying the YAML fix, run this to ensure the context is discoverable:


🏁 Script executed:

#!/bin/bash
# Verify the l0_b300 context exists and is referenced
fd -a 'l0_b300.yml'
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml

Length of output: 163


Define matching l0_b300 context in test-db

The new stage entry uses GPU context l0_b300, but tests/integration/test_lists/test-db/l0_b300.yml does not define that key (it still defines l0_b200:), so the stage will render an empty list.

– In tests/integration/test_lists/test-db/l0_b300.yml, update the root key to l0_b300: (e.g., change l0_b200: to l0_b300:).
– After updating, verify with:

fd -a l0_b300.yml
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1846-1847, the new stage references GPU
context "l0_b300" but the test-db file
tests/integration/test_lists/test-db/l0_b300.yml still defines the root key as
"l0_b200", so the stage will render an empty list; update the root key in
tests/integration/test_lists/test-db/l0_b300.yml from "l0_b200:" to "l0_b300:"
and save, then verify the change exists (for example run fd -a l0_b300.yml and
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml).

@@ -0,0 +1,17 @@
version: 0.0.1
l0_b200:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Blocking: Context key mismatch will yield an empty test list

The top-level key is l0_b200 but the file name and Jenkins job use l0_b300. The test-db renderer looks up the context by name; this mismatch will cause no tests to be selected for B300.

Please rename the key to match the context.

Apply this diff:

-version: 0.0.1
-l0_b200:
+version: 0.0.1
+l0_b300:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
l0_b200:
version: 0.0.1
l0_b300:
🤖 Prompt for AI Agents
In tests/integration/test_lists/test-db/l0_b300.yml around line 2, the top-level
context key is mistakenly set to "l0_b200" which does not match the filename and
Jenkins job "l0_b300"; rename the top-level key from "l0_b200" to "l0_b300" so
the test-db renderer can find the context and select the correct tests (ensure
the key exactly matches the file/job name).

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "B300_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15739 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15739 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11828 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "B300_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16008 [ run ] triggered by Bot

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "B300_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16013 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16008 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)

2-2: Context key mismatch: l0_b200 vs. expected l0_dgx_b300 (breaks test DB resolution).

The Jenkins stage maps to context "l0_dgx_b300", but this YAML defines "l0_b200". This will render an empty test list. Fix the root key to match the stage mapping.

Apply this diff:

-version: 0.0.1
-l0_b200:
+version: 0.0.1
+l0_dgx_b300:
🧹 Nitpick comments (4)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (2)

9-11: Make GPU wildcard consistent with chip naming (case).

If your sysinfo reports GPU names with uppercase prefixes (e.g., "NVIDIA_HGX_B300"), a lowercase wildcard may not match depending on DB processing. Prefer an uppercase wildcard to be safe, consistent with other contexts.

Apply this diff:

-      gpu:
-      - '*b300*'
+      gpu:
+      - '*B300*'

16-17: Quote the test line to avoid YAML parsing quirks.

YAML will treat the entire scalar after '-' as a single value, but quoting avoids surprises with the inner -k and quotes.

Apply this diff:

-  - unittest/_torch/multi_gpu_modeling -k "deepseek"
+  - "unittest/_torch/multi_gpu_modeling -k deepseek"
jenkins/L0_Test.groovy (2)

658-661: Hardcoded B300 GPU type override is fine; also scale resources for B300 multi-GPU.

The override to "NVIDIA_HGX_B300" ensures correct node selection. However, multi-GPU resource scaling currently only applies to DGX-H100/H200, which may under-provision memory/CPU for 4x B300 jobs.

Apply this diff to include B300 in the scaling clause:

-        // Multi-GPU only supports DGX-H100 and DGX-H200 due to the hardware stability.
-        if ((type.contains("dgx-h100") || type.contains("dgx-h200")) && hasMultipleGPUs)
+        // Multi-GPU scaling: include DGX-H100/H200 and B300.
+        if ((type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("b300-x4")) && hasMultipleGPUs)
         {
             // Not a hard requirement, but based on empirical values.
             memorySize = "${gpuCount * 150}" + "Gi"
             storageSize = "${gpuCount * 150}" + "Gi"
             cpuCount = "${gpuCount * 12}"
         }

Also update the outdated comment that claims multi-GPU only supports DGX-H100/H200.


667-671: Confirm B300 dynamic driver flashing support; add exception if needed.

B300 isn’t listed among GPU types that skip the driver_version selector. If B300 nodes lack nvidia.com/driver_version or dynamic flashing support, scheduling will fail. Consider adding B300 to this exception list.

Proposed change (only if B300 nodes don’t advertise driver_version):

-        if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {
+        if (type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("b300") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {

Please verify with your cluster labels before applying.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d3c1e22 and c5f4874.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (3 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

2468-2471: Gating: DGX_B300 sign included.

Adding "DGX_B300" ensures the multi-GPU grouping picks up the new stage in phase 2. No issues.

Comment on lines 1849 to 1880
"DGX_B300-4_GPUs-PyTorch-1": ["b300-x4", "l0_dgx_b300", 1, 1, 4],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Stage wiring for DGX_B300 looks correct, pending test-db context fix.

The entry uses ["b300-x4", "l0_dgx_b300", 1, 1, 4]. Once the YAML context is corrected to l0_dgx_b300, the test list should render and run on a 4-GPU B300 pod.

Run:


🏁 Script executed:

#!/bin/bash
rg -nC2 $'^\\s*\"DGX_B300-4_GPUs-PyTorch-1\": \\[\"b300-x4\", \"l0_dgx_b300\", 1, 1, 4\\],' jenkins/L0_Test.groovy

Length of output: 392


🏁 Script executed:

#!/bin/bash
# Search for L0 context definitions for DGX entries in YAML files
rg -n "l0_dgx" -g "*.yml" -g "*.yaml"

Length of output: 241


Add missing test-db context for DGX_B300

The Groovy entry in jenkins/L0_Test.groovy on lines 1847–1850 is correct:

  • "DGX_B300-4_GPUs-PyTorch-1": ["b300-x4", "l0_dgx_b300", 1, 1, 4],

However, there’s no corresponding YAML context file for l0_dgx_b300 under tests/integration/test_lists/test-db. You’ll need to add it so that the test list renders and runs on a 4-GPU B300 pod.

Please create:

  • tests/integration/test_lists/test-db/l0_dgx_b300.yml

with the appropriate L0 configuration (mirroring the structure of the existing H200 and B200 contexts). After adding that file, the DGX_B300 stage will execute as intended.

🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1849-1850, the test matrix references
"l0_dgx_b300" but no YAML context exists; create
tests/integration/test_lists/test-db/l0_dgx_b300.yml modeled on the existing
H200/B200 context files: set the top-level name to l0_dgx_b300, include the same
keys used by other L0 contexts (platform/label for b300-x4, test_list or tests
to run, node counts, GPU/device_count set to 4, any timeout/parallelism and
env/settings matching L0 conventions), and ensure the file structure and keys
match exactly so the Jenkins stage can load the context and run the 4‑GPU B300
tests.

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "B300_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16018 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16013 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16018 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12040 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "B300_PCIe-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16041 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1846-1847: Stage wiring for DGX_B300 looks good, but ensure the test-db context exists and matches.

  • The stage references l0_dgx_b300. Ensure the YAML exists and its top-level key is l0_dgx_b300: (not l0_b200:). Without that, the rendered list will be empty.

Run:

#!/bin/bash
fd -a 'l0_dgx_b300.yml' tests
rg -n -C2 '^\s*l0_dgx_b300\s*:' tests/integration/test_lists/test-db/l0_dgx_b300.yml

Also note the PR comments used /bot run --stage-list "B300_PCIe-PyTorch-1", but there is no such stage here. Either:

  • Add a matching single-GPU B300 PCIe stage, or
  • Re-run with the new name "DGX_B300-4_GPUs-PyTorch-1".

To confirm, list current B300-related stages:

#!/bin/bash
rg -nP '^\s*"B300_PCIe-[^"]+":|\bDGX_B300-[^"]+":' jenkins/L0_Test.groovy

If you want the bot-friendly alias Stage as well, we can add it once the correct platform key for single-GPU B300 is known (e.g., b300-cr or similar). I can prepare that patch once you confirm the exact KubernetesManager mapping.

🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

2465-2468: Including DGX_B300 in the phase-2 DGX gating set is correct.

This ensures B300 multi-GPU stages run in the DGX phase. No issues spotted.

Optionally, derive dgxSigns from the keys in x86TestConfigs matching ^DGX_ to avoid future drift when adding new DGX variants.


649-656: Optional: Parameterize multi-GPU resource scaling

Verified that the current defaults for DGX-H100 and DGX-H200 nodes safely fit within their capacities:

  • System RAM: nodes expose 2 TiB of host memory; 8×150 GiB = 1,200 GiB (~1.17 TiB) leaves ~800 GiB free (docs.nvidia.com, nvidia.com)
  • Ephemeral-storage: 2×1.92 TB OS NVMe + 8×3.84 TB data NVMe (~30.7 TiB total); 8×150 GiB = 1,200 GiB (~1.17 TiB) is well under capacity (docs.nvidia.com, nvidia.com)
  • CPU: 112 physical cores (2× Intel® Xeon® Platinum 8480C); 8×12 cores = 96 cores (nvidia.com)

Even though these defaults are safe on current GPU node types, making the per-GPU multipliers configurable ensures future node-type or pool-specific tuning without code changes.

Suggested change:

         if (hasMultipleGPUs)
         {
-            // Not a hard requirement, but based on empirical values.
-            memorySize = "${gpuCount * 150}" + "Gi"
-            storageSize = "${gpuCount * 150}" + "Gi"
-            cpuCount = "${gpuCount * 12}"
+            // Allow pool-specific tuning via env, with sane defaults.
+            def memPerGpuGi     = ((env.memPerGpuGi     ?: '150') as Integer)
+            def storagePerGpuGi = ((env.storagePerGpuGi ?: '150') as Integer)
+            def cpuPerGpu        = ((env.cpuPerGpu        ?: '12')  as Integer)
+            memorySize  = "${gpuCount * memPerGpuGi}Gi"
+            storageSize = "${gpuCount * storagePerGpuGi}Gi"
+            cpuCount     = "${gpuCount * cpuPerGpu}"
         }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 236aec8 and 061a9bb.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

664-669: Verify DGX_B300 “no dynamic driver flashing” selector

I confirmed that DGX_B300 is defined in jenkins/L0_Test.groovy (line 1846) alongside other DGX variants, and that the selector logic around lines 664–672 excludes certain types (dgx-h100, dgx-h200, b100-ts2, gh200, rtx-5080, rtx-5090) from driver_version pinning to avoid dynamic flashing mismatches.

– File: jenkins/L0_Test.groovy
• DGX_B300 entry: line 1846
• Selector block: lines 664–672

Suggested diff to include DGX_B300 in the exclusion list:

-        if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {
+        if (type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("dgx-b300") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {

Please confirm whether your DGX_B300 nodes require fixed drivers (i.e., do not support dynamic flashing). If so, this change will prevent unintended nvidia.com/driver_version label pinning and reduce scheduling conflicts.

@@ -1,4 +1,4 @@
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Do not merge with a forked shared library reference; pin or revert to main.

Loading the shared lib from a user fork (@user/yiqingy/dgx_b300) is risky for mainline: it’s mutable, can disappear, and bypasses normal review. Either pin to an immutable commit SHA in the official lib or revert to @main before merge.

Proposed fix:

-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

If you need the B300 mapping from the fork, upstream it to the main shared lib and then reference @main.
Run to confirm no stray fork refs remain:


🏁 Script executed:

#!/bin/bash
rg -n "bloom-jenkins-shared-lib@user" -C2

Length of output: 252


Pin or revert the forked shared library reference before merging

The ripgrep check confirms the only occurrence of the forked library reference is in jenkins/L0_Test.groovy (line 1). To avoid depending on a mutable fork, update the annotation to use the official @main branch.

• jenkins/L0_Test.groovy, line 1

-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main',    'trtllm-jenkins-shared-lib@main']) _

After updating, rerun:

rg -n "bloom-jenkins-shared-lib@user" -C2

to verify no remaining fork references.

🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the @Library annotation references a
forked shared library ('bloom-jenkins-shared-lib@user/yiqingy/dgx_b300'); update
this to point to the official branch (e.g. 'bloom-jenkins-shared-lib@main') to
avoid depending on a mutable fork, commit the change, and then rerun the ripgrep
check (rg -n "bloom-jenkins-shared-lib@user" -C2) to confirm no remaining fork
references.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16041 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12062 (Partly Tested) completed with status: 'FAILURE'

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16101 [ run ] triggered by Bot

@yiqingy0
Copy link
Collaborator Author

/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16106 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16101 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)

2-2: Context key name looks right; double-check end-to-end wiring.

l0_dgx_b300 matches the context name referenced in the Jenkins mapping per the PR summary. Please re-verify that the mapping points to this exact key and that the pod label token matches your wildcards.

Use this quick wiring check:

#!/bin/bash
set -euo pipefail

# Mapping contains the context key
rg -nC2 $'\"DGX_B300-4_GPUs-PyTorch-1\"\\s*:\\s*\\[.*\\b(l0_dgx_b300)\\b' jenkins/L0_Test.groovy

# YAML defines the context key
rg -n $'^\\s*l0_dgx_b300:' tests/integration/test_lists/test-db/l0_dgx_b300.yml
🧹 Nitpick comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)

11-11: Review linux_distribution_name wildcard for consistency with CI Ubuntu version
We’ve observed that tests under tests/integration/test_lists/test-db/ (including l0_dgx_b300.yml at line 11) all use

linux_distribution_name: ubuntu*

which will match any Ubuntu release. To guard against accidental drift when newer Ubuntu versions land in CI, please verify the exact Ubuntu version your CI agents are running (e.g., 22.04). If they are pinned to Ubuntu 22.04, consider tightening the pattern as follows:

-      linux_distribution_name: ubuntu*
+      linux_distribution_name: ubuntu22.*

• File: tests/integration/test_lists/test-db/l0_dgx_b300.yml
• Line: 11

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 061a9bb and 38376a4.

📒 Files selected for processing (1)
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (3)

3-7: GPU count filter is correct for a 4‑GPU stage.

system_gpu_count: gte: 4 / lte: 4 aligns with the intended "DGX_B300-4_GPUs-PyTorch-1" job selection. No change needed.


1-1: All test-db YAML files consistently use schema version 0.0.1

I’ve confirmed via ripgrep that every file under tests/integration/test_lists/test-db/ (including l0_dgx_b300.yml) declares version: 0.0.1, matching the rest of the suite. No alignment changes are needed.


15-17: Multi-GPU deepseek test selector is valid and includes 4-GPU case

  • The directory tests/unittest/_torch/multi_gpu_modeling exists and contains test_deepseek.py and test_llama4.py.
  • In test_deepseek.py, the test_deepseek_streaming function is parametrized with tp_size values [1, 4], so running
    pytest unittest/_torch/multi_gpu_modeling -k "deepseek"
    will execute both the single-GPU and the 4-GPU variants (skipping only if fewer than 4 GPUs are available).

No further changes needed.

Comment on lines +8 to +11
wildcards:
gpu:
- '*gb110*'
linux_distribution_name: ubuntu*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

GPU wildcard likely mismatched; will filter out DGX_B300 nodes.

Using '*gb110*' under wildcards.gpu probably won’t match the infra label used by the Jenkins mapping (e.g., "dgx-b300-x4" or a *b300* pattern). Result: no nodes selected, zero tests scheduled.

Suggest matching the platform label and/or covering both patterns:

     wildcards:
       gpu:
-      - '*gb110*'
+      - '*b300*'
+      - '*gb200*'   # optional: if your inventory labels GPUs by GB200

If you prefer to be strict, keep only the one that your cluster actually advertises.

Run this to confirm what the pipeline expects vs what your YAML matches:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Verify the Jenkins stage→context mapping and the pod label token
rg -nC2 $'DGX_B300-4_GPUs-PyTorch-1' jenkins/L0_Test.groovy

# 2) Inspect any B300-related labels in the repo (pod templates, docs, etc.)
rg -nCI3 --hidden --glob '!**/.git/**' '(b300|dgx-b300|gb110|gb200)' -S

# 3) Spot-check other test-db entries to learn the convention for GPU wildcards
rg -nC2 $'^\\s*wildcards:\\s*\\n\\s*gpu:' tests/integration/test_lists/test-db

Length of output: 517


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1. List DGX-related YAMLs in test-db
ls tests/integration/test_lists/test-db | grep l0_dgx

# 2. Show the wildcard block in l0_dgx_b300.yml
sed -n '1,200p' tests/integration/test_lists/test-db/l0_dgx_b300.yml | sed -n '/wildcards:/,/^[^ ]/!d'

# 3. If present, show the wildcard block in l0_dgx_h200.yml for comparison
if [ -f tests/integration/test_lists/test-db/l0_dgx_h200.yml ]; then
  sed -n '1,200p' tests/integration/test_lists/test-db/l0_dgx_h200.yml | sed -n '/wildcards:/,/^[^ ]/!d'
fi

Length of output: 411


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Jenkins mapping for DGX_B300 ==="
grep -R "DGX_B300-4_GPUs-PyTorch-1" -n jenkins/L0_Test.groovy || echo "No match found"

echo -e "\n=== GPU wildcards in l0_dgx_b300.yml ==="
grep -R -n "wildcards:" -A5 tests/integration/test_lists/test-db/l0_dgx_b300.yml || echo "No wildcards block found"

echo -e "\n=== GPU wildcards in other DGX YAMLs ==="
for file in tests/integration/test_lists/test-db/l0_dgx_b200.yml \
             tests/integration/test_lists/test-db/l0_dgx_h100.yml \
             tests/integration/test_lists/test-db/l0_dgx_h200.yml; do
  echo "File: $file"
  grep -R -n "wildcards:" -A5 "$file" || echo "  No wildcards block found"
  echo
done

Length of output: 3045


Fix GPU wildcard in l0_dgx_b300.yml

The current GPU wildcard '*gb110*' does not match the node label "dgx-b300-x4" used by Jenkins (see mapping in jenkins/L0_Test.groovy). As a result, no nodes are selected and tests never run.

Update the wildcard to align with other DGX entries:

     wildcards:
       gpu:
-      - '*gb110*'
+      - '*b300*'

• File: tests/integration/test_lists/test-db/l0_dgx_b300.yml, lines 8–11
• Confirm that the Jenkins mapping
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", …]
matches the '*b300*' pattern.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wildcards:
gpu:
- '*gb110*'
linux_distribution_name: ubuntu*
wildcards:
gpu:
- '*b300*'
linux_distribution_name: ubuntu*
🤖 Prompt for AI Agents
In tests/integration/test_lists/test-db/l0_dgx_b300.yml around lines 8 to 11 the
GPU wildcard '*gb110*' does not match the Jenkins node label "dgx-b300-x4",
causing no nodes to be selected; update the wildcard to '*b300*' so it matches
the Jenkins mapping (e.g. "DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4",
"l0_dgx_b300", …]) and verify the file now selects the intended DGX B300 nodes.

Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1-1: Do not merge with forked shared library; pin to official branch/commit.

Referencing a user fork in @Library is a supply-chain and stability risk for mainline. Upstream the B300 changes to the official shared lib and point to @main (or an immutable commit SHA) before merge.

Apply this diff:

-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

Run to confirm no stray fork refs remain:

#!/bin/bash
rg -n "bloom-jenkins-shared-lib@user" -C2 || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

684-686: Add a safe fallback if selectGPU lacks a mapping for new GPU aliases

If KubernetesManager.selectGPU(type) doesn’t yet know dgx-b300-x4, pod scheduling will fail. Add a minimal fallback to strip the -xN suffix and use the base GPU type. This is non-invasive and improves robustness while you upstream the mapping to the shared lib.

Apply this diff:

-        def gpuType = KubernetesManager.selectGPU(type)
+        def gpuType = KubernetesManager.selectGPU(type)
+        if (!gpuType) {
+            // Fallback: map "<gpu>-xN" -> "<gpu>" so scheduling can proceed
+            gpuType = (type ?: "").replaceAll(/-x\d+$/, "")
+        }

If you prefer not to change behavior here, confirm the shared lib already maps dgx-b300-x4 to dgx-b300 so this path is never hit.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 612733a and 0fc4f5b.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (3 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_dgx_b300.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

689-695: Add DGX B300 to “no dynamic driver flashing” list — LGTM

Including type.contains("dgx-b300") ensures we select by nvidia.com/gpu_type only (no driver pin), which aligns with other DGX classes. This should prevent scheduling attempts that require on-the-fly driver flashing on DGX B300 hosts.

Please confirm the DGX B300 nodes expose the nvidia.com/gpu_type label that matches the value returned by KubernetesManager.selectGPU("dgx-b300-x4") (typically dgx-b300). If not, jobs will remain unschedulable.

Comment on lines +1879 to 1880
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", 1, 1, 4],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Stage entry looks correct; verify matching test-db context and constraints exist

The new stage wiring:

  • Name: DGX_B300-4_GPUs-PyTorch-1
  • Platform: dgx-b300-x4
  • Test context: l0_dgx_b300
  • GPU count: 4

Ensure the test-db contains a matching context with correct root key and constraints; otherwise the rendered test list will be empty and the stage will no-op/fail.

Run:

If you’d like, I can draft a tests/integration/test_lists/test-db/l0_dgx_b300.yml modeled on existing DGX contexts (H200/H100), with system_gpu_count: 4, stage: pre_merge, and PyTorch filters.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# 1) Context file exists
fd -a 'l0_dgx_b300.yml' tests || true
# 2) Root key matches the context name
rg -n -C2 $'^\s*l0_dgx_b300:' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
# 3) Basic constraints sanity (tune as needed)
rg -n -C1 'system_gpu_count\s*:\s*4' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
rg -n -C1 'stage\s*:\s*pre_merge' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
rg -n -C1 'backend\s*:\s*pytorch' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true

Length of output: 673


Missing GPU count constraint in test‐db context

The context file tests/integration/test_lists/test-db/l0_dgx_b300.yml is present and the root key l0_dgx_b300: is correct. The stage: pre_merge and backend: pytorch constraints are also defined. However, there is no system_gpu_count: 4 term, so this context won’t match the new stage and the pipeline will no-op or fail.

Please add the GPU count constraint under the terms: section. For example:

  terms:
    stage: pre_merge
+   system_gpu_count: 4
    backend: pytorch
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", 1, 1, 4],
]
terms:
stage: pre_merge
system_gpu_count: 4
backend: pytorch
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1879-1880, the mapping references
l0_dgx_b300 but the test-db context file
tests/integration/test_lists/test-db/l0_dgx_b300.yml lacks the GPU-count
constraint; open that YAML and under the l0_dgx_b300: -> terms: section add
system_gpu_count: 4 with correct indentation so the context matches the
DGX_B300-4_GPUs-PyTorch-1 entry and the stage/backend constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants