-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6139][infra] Add b300 tests #6986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds DGX_B300 4-GPU test entries to the Jenkins L0 test matrix and updates pod GPU-type handling; introduces a new test-db YAML defining the l0_dgx_b300 pre-merge PyTorch multi-GPU test. Also changes a library load path to a forked bloom shared-lib reference. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Jenkins as Jenkins L0 pipeline
participant Matrix as launchTestJobs
participant Cfg as createKubernetesPodConfig
participant Kube as Kubernetes
participant TestDB as Test DB (l0_dgx_b300.yml)
Jenkins->>Matrix: build test matrix (includes DGX_B300-4_GPUs-PyTorch-1)
Matrix->>Cfg: request pod config (arch, gpuType, gpuCount, image...)
Cfg-->>Matrix: return pod spec (gpu-type check includes dgx-b300)
Matrix->>Kube: launch job(s) with pod spec
Kube->>TestDB: query tests for matching job (loads l0_dgx_b300.yml)
Kube->>Kube: run tests (PyTorch multi-GPU)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_b300.yml (1)
5-7
: Double-check GPU count gate vs selected testsThe condition gates to exactly one GPU (gte: 1, lte: 1), while the selected test is from
multi_gpu_modeling
. If that test expects >1 GPUs on B300 nodes, this stage will systematically skip/xfail or under-exercise the path.If needed, relax the GPU gate (e.g., gte: 2 for true multi-GPU) or pick a single-GPU-friendly sanity test for the initial B300 bring-up. Want a suggested minimal single-GPU PyTorch test list?
Also applies to: 15-17
jenkins/L0_Test.groovy (1)
657-679
: Consider centralizing mapping and verify driver-flashing constraints for B300Two follow-ups to keep things robust:
- Prefer normalizing
type
to lowercase before checks to avoid case drift across call sites.- Verify whether B300 belongs in the “no dynamic driver flashing” bucket (the branch that omits
nvidia.com/driver_version
). If it does, include it alongside the existing types to avoid scheduling mismatches.Proposed tweaks:
- def gpuType = KubernetesManager.selectGPU(type) - if (type.contains("b300")) { + def normalizedType = type?.toLowerCase() + def gpuType = KubernetesManager.selectGPU(normalizedType) + if (normalizedType.contains("b300")) { gpuType = "NVIDIA_HGX_B300" } @@ - if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) { + if (normalizedType.contains("dgx-h100") || normalizedType.contains("dgx-h200") || + normalizedType in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"] || + normalizedType.contains("b300") /* if B300 should avoid driver flashing */) {If the driver-flashing rule doesn’t apply to B300, skip the last addition and keep only the normalization.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(2 hunks)tests/integration/test_lists/test-db/l0_b300.yml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/test_lists/test-db/l0_b300.yml (1)
8-11
: linux_distribution_name placement is correctVerified across all
tests/integration/test_lists/test-db
entries (e.g. l0_h100.yml, l0_perf.yml, etc.) thatlinux_distribution_name
is always nested underwildcards
. No change required.jenkins/L0_Test.groovy (1)
657-659
: LGTM: Explicit B300 gpuType overrideHard-setting B300 to "NVIDIA_HGX_B300" avoids mis-selection by the generic mapper. This is a pragmatic guardrail for early enablement.
jenkins/L0_Test.groovy
Outdated
"B300_PCIe-PyTorch-1": ["b300", "l0_b300", 1, 1], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Stage wiring looks correct, but it depends on the test-db context fix
The new entry uses ["b300", "l0_b300", 1, 1]
, which aligns with other single-GPU entries and the naming convention used by filters (GPU token: B300_PCIe
). However, the test-db file currently defines the context as l0_b200
. Without fixing that, this stage will render an empty list.
After applying the YAML fix, run this to ensure the context is discoverable:
🏁 Script executed:
#!/bin/bash
# Verify the l0_b300 context exists and is referenced
fd -a 'l0_b300.yml'
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml
Length of output: 163
Define matching l0_b300
context in test-db
The new stage entry uses GPU context l0_b300
, but tests/integration/test_lists/test-db/l0_b300.yml
does not define that key (it still defines l0_b200:
), so the stage will render an empty list.
– In tests/integration/test_lists/test-db/l0_b300.yml
, update the root key to l0_b300:
(e.g., change l0_b200:
to l0_b300:
).
– After updating, verify with:
fd -a l0_b300.yml
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1846-1847, the new stage references GPU
context "l0_b300" but the test-db file
tests/integration/test_lists/test-db/l0_b300.yml still defines the root key as
"l0_b200", so the stage will render an empty list; update the root key in
tests/integration/test_lists/test-db/l0_b300.yml from "l0_b200:" to "l0_b300:"
and save, then verify the change exists (for example run fd -a l0_b300.yml and
rg -n -C2 '^\s*l0_b300:' tests/integration/test_lists/test-db/l0_b300.yml).
@@ -0,0 +1,17 @@ | |||
version: 0.0.1 | |||
l0_b200: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking: Context key mismatch will yield an empty test list
The top-level key is l0_b200
but the file name and Jenkins job use l0_b300
. The test-db renderer looks up the context by name; this mismatch will cause no tests to be selected for B300.
Please rename the key to match the context.
Apply this diff:
-version: 0.0.1
-l0_b200:
+version: 0.0.1
+l0_b300:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
l0_b200: | |
version: 0.0.1 | |
l0_b300: |
🤖 Prompt for AI Agents
In tests/integration/test_lists/test-db/l0_b300.yml around line 2, the top-level
context key is mistakenly set to "l0_b200" which does not match the filename and
Jenkins job "l0_b300"; rename the top-level key from "l0_b200" to "l0_b300" so
the test-db renderer can find the context and select the correct tests (ensure
the key exactly matches the file/job name).
/bot run --stage-list "B300_PCIe-PyTorch-1" |
PR_Github #15739 [ run ] triggered by Bot |
PR_Github #15739 [ run ] completed with state |
/bot run --stage-list "B300_PCIe-PyTorch-1" |
PR_Github #16008 [ run ] triggered by Bot |
/bot run --stage-list "B300_PCIe-PyTorch-1" |
PR_Github #16013 [ run ] triggered by Bot |
PR_Github #16008 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)
2-2
: Context key mismatch: l0_b200 vs. expected l0_dgx_b300 (breaks test DB resolution).The Jenkins stage maps to context "l0_dgx_b300", but this YAML defines "l0_b200". This will render an empty test list. Fix the root key to match the stage mapping.
Apply this diff:
-version: 0.0.1 -l0_b200: +version: 0.0.1 +l0_dgx_b300:
🧹 Nitpick comments (4)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (2)
9-11
: Make GPU wildcard consistent with chip naming (case).If your sysinfo reports GPU names with uppercase prefixes (e.g., "NVIDIA_HGX_B300"), a lowercase wildcard may not match depending on DB processing. Prefer an uppercase wildcard to be safe, consistent with other contexts.
Apply this diff:
- gpu: - - '*b300*' + gpu: + - '*B300*'
16-17
: Quote the test line to avoid YAML parsing quirks.YAML will treat the entire scalar after '-' as a single value, but quoting avoids surprises with the inner -k and quotes.
Apply this diff:
- - unittest/_torch/multi_gpu_modeling -k "deepseek" + - "unittest/_torch/multi_gpu_modeling -k deepseek"jenkins/L0_Test.groovy (2)
658-661
: Hardcoded B300 GPU type override is fine; also scale resources for B300 multi-GPU.The override to "NVIDIA_HGX_B300" ensures correct node selection. However, multi-GPU resource scaling currently only applies to DGX-H100/H200, which may under-provision memory/CPU for 4x B300 jobs.
Apply this diff to include B300 in the scaling clause:
- // Multi-GPU only supports DGX-H100 and DGX-H200 due to the hardware stability. - if ((type.contains("dgx-h100") || type.contains("dgx-h200")) && hasMultipleGPUs) + // Multi-GPU scaling: include DGX-H100/H200 and B300. + if ((type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("b300-x4")) && hasMultipleGPUs) { // Not a hard requirement, but based on empirical values. memorySize = "${gpuCount * 150}" + "Gi" storageSize = "${gpuCount * 150}" + "Gi" cpuCount = "${gpuCount * 12}" }Also update the outdated comment that claims multi-GPU only supports DGX-H100/H200.
667-671
: Confirm B300 dynamic driver flashing support; add exception if needed.B300 isn’t listed among GPU types that skip the driver_version selector. If B300 nodes lack nvidia.com/driver_version or dynamic flashing support, scheduling will fail. Consider adding B300 to this exception list.
Proposed change (only if B300 nodes don’t advertise driver_version):
- if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) { + if (type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("b300") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {Please verify with your cluster labels before applying.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(3 hunks)tests/integration/test_lists/test-db/l0_dgx_b300.yml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
2468-2471
: Gating: DGX_B300 sign included.Adding "DGX_B300" ensures the multi-GPU grouping picks up the new stage in phase 2. No issues.
jenkins/L0_Test.groovy
Outdated
"DGX_B300-4_GPUs-PyTorch-1": ["b300-x4", "l0_dgx_b300", 1, 1, 4], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Stage wiring for DGX_B300 looks correct, pending test-db context fix.
The entry uses ["b300-x4", "l0_dgx_b300", 1, 1, 4]. Once the YAML context is corrected to l0_dgx_b300, the test list should render and run on a 4-GPU B300 pod.
Run:
🏁 Script executed:
#!/bin/bash
rg -nC2 $'^\\s*\"DGX_B300-4_GPUs-PyTorch-1\": \\[\"b300-x4\", \"l0_dgx_b300\", 1, 1, 4\\],' jenkins/L0_Test.groovy
Length of output: 392
🏁 Script executed:
#!/bin/bash
# Search for L0 context definitions for DGX entries in YAML files
rg -n "l0_dgx" -g "*.yml" -g "*.yaml"
Length of output: 241
Add missing test-db context for DGX_B300
The Groovy entry in jenkins/L0_Test.groovy on lines 1847–1850 is correct:
"DGX_B300-4_GPUs-PyTorch-1": ["b300-x4", "l0_dgx_b300", 1, 1, 4],
However, there’s no corresponding YAML context file for l0_dgx_b300
under tests/integration/test_lists/test-db
. You’ll need to add it so that the test list renders and runs on a 4-GPU B300 pod.
Please create:
- tests/integration/test_lists/test-db/l0_dgx_b300.yml
with the appropriate L0 configuration (mirroring the structure of the existing H200 and B200 contexts). After adding that file, the DGX_B300 stage will execute as intended.
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1849-1850, the test matrix references
"l0_dgx_b300" but no YAML context exists; create
tests/integration/test_lists/test-db/l0_dgx_b300.yml modeled on the existing
H200/B200 context files: set the top-level name to l0_dgx_b300, include the same
keys used by other L0 contexts (platform/label for b300-x4, test_list or tests
to run, node counts, GPU/device_count set to 4, any timeout/parallelism and
env/settings matching L0 conventions), and ensure the file structure and keys
match exactly so the Jenkins stage can load the context and run the 4‑GPU B300
tests.
/bot run --stage-list "B300_PCIe-PyTorch-1" |
PR_Github #16018 [ run ] triggered by Bot |
PR_Github #16013 [ run ] completed with state |
PR_Github #16018 [ run ] completed with state |
/bot run --stage-list "B300_PCIe-PyTorch-1" |
PR_Github #16041 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1846-1847
: Stage wiring for DGX_B300 looks good, but ensure the test-db context exists and matches.
- The stage references
l0_dgx_b300
. Ensure the YAML exists and its top-level key isl0_dgx_b300:
(notl0_b200:
). Without that, the rendered list will be empty.Run:
#!/bin/bash fd -a 'l0_dgx_b300.yml' tests rg -n -C2 '^\s*l0_dgx_b300\s*:' tests/integration/test_lists/test-db/l0_dgx_b300.ymlAlso note the PR comments used
/bot run --stage-list "B300_PCIe-PyTorch-1"
, but there is no such stage here. Either:
- Add a matching single-GPU B300 PCIe stage, or
- Re-run with the new name
"DGX_B300-4_GPUs-PyTorch-1"
.To confirm, list current B300-related stages:
#!/bin/bash rg -nP '^\s*"B300_PCIe-[^"]+":|\bDGX_B300-[^"]+":' jenkins/L0_Test.groovyIf you want the bot-friendly alias Stage as well, we can add it once the correct platform key for single-GPU B300 is known (e.g.,
b300-cr
or similar). I can prepare that patch once you confirm the exact KubernetesManager mapping.
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
2465-2468
: Including DGX_B300 in the phase-2 DGX gating set is correct.This ensures B300 multi-GPU stages run in the DGX phase. No issues spotted.
Optionally, derive
dgxSigns
from the keys inx86TestConfigs
matching^DGX_
to avoid future drift when adding new DGX variants.
649-656
: Optional: Parameterize multi-GPU resource scalingVerified that the current defaults for DGX-H100 and DGX-H200 nodes safely fit within their capacities:
- System RAM: nodes expose 2 TiB of host memory; 8×150 GiB = 1,200 GiB (~1.17 TiB) leaves ~800 GiB free (docs.nvidia.com, nvidia.com)
- Ephemeral-storage: 2×1.92 TB OS NVMe + 8×3.84 TB data NVMe (~30.7 TiB total); 8×150 GiB = 1,200 GiB (~1.17 TiB) is well under capacity (docs.nvidia.com, nvidia.com)
- CPU: 112 physical cores (2× Intel® Xeon® Platinum 8480C); 8×12 cores = 96 cores (nvidia.com)
Even though these defaults are safe on current GPU node types, making the per-GPU multipliers configurable ensures future node-type or pool-specific tuning without code changes.
Suggested change:
if (hasMultipleGPUs) { - // Not a hard requirement, but based on empirical values. - memorySize = "${gpuCount * 150}" + "Gi" - storageSize = "${gpuCount * 150}" + "Gi" - cpuCount = "${gpuCount * 12}" + // Allow pool-specific tuning via env, with sane defaults. + def memPerGpuGi = ((env.memPerGpuGi ?: '150') as Integer) + def storagePerGpuGi = ((env.storagePerGpuGi ?: '150') as Integer) + def cpuPerGpu = ((env.cpuPerGpu ?: '12') as Integer) + memorySize = "${gpuCount * memPerGpuGi}Gi" + storageSize = "${gpuCount * storagePerGpuGi}Gi" + cpuCount = "${gpuCount * cpuPerGpu}" }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
664-669
: Verify DGX_B300 “no dynamic driver flashing” selectorI confirmed that DGX_B300 is defined in
jenkins/L0_Test.groovy
(line 1846) alongside other DGX variants, and that the selector logic around lines 664–672 excludes certain types (dgx-h100, dgx-h200, b100-ts2, gh200, rtx-5080, rtx-5090) from driver_version pinning to avoid dynamic flashing mismatches.– File:
jenkins/L0_Test.groovy
• DGX_B300 entry: line 1846
• Selector block: lines 664–672Suggested diff to include DGX_B300 in the exclusion list:
- if (type.contains("dgx-h100") || type.contains("dgx-h200") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) { + if (type.contains("dgx-h100") || type.contains("dgx-h200") || type.contains("dgx-b300") || type in ["b100-ts2", "gh200", "rtx-5080", "rtx-5090"]) {Please confirm whether your DGX_B300 nodes require fixed drivers (i.e., do not support dynamic flashing). If so, this change will prevent unintended
nvidia.com/driver_version
label pinning and reduce scheduling conflicts.
@@ -1,4 +1,4 @@ | |||
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | |||
@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not merge with a forked shared library reference; pin or revert to main.
Loading the shared lib from a user fork (@user/yiqingy/dgx_b300
) is risky for mainline: it’s mutable, can disappear, and bypasses normal review. Either pin to an immutable commit SHA in the official lib or revert to @main
before merge.
Proposed fix:
-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
If you need the B300 mapping from the fork, upstream it to the main shared lib and then reference @main
.
Run to confirm no stray fork refs remain:
🏁 Script executed:
#!/bin/bash
rg -n "bloom-jenkins-shared-lib@user" -C2
Length of output: 252
Pin or revert the forked shared library reference before merging
The ripgrep check confirms the only occurrence of the forked library reference is in jenkins/L0_Test.groovy (line 1). To avoid depending on a mutable fork, update the annotation to use the official @main branch.
• jenkins/L0_Test.groovy, line 1
-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
After updating, rerun:
rg -n "bloom-jenkins-shared-lib@user" -C2
to verify no remaining fork references.
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the @Library annotation references a
forked shared library ('bloom-jenkins-shared-lib@user/yiqingy/dgx_b300'); update
this to point to the official branch (e.g. 'bloom-jenkins-shared-lib@main') to
avoid depending on a mutable fork, commit the change, and then rerun the ripgrep
check (rg -n "bloom-jenkins-shared-lib@user" -C2) to confirm no remaining fork
references.
PR_Github #16041 [ run ] completed with state |
/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-1" |
PR_Github #16101 [ run ] triggered by Bot |
/bot run --stage-list "DGX_B300-4_GPUs-PyTorch-1" |
PR_Github #16106 [ run ] triggered by Bot |
PR_Github #16101 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)
2-2
: Context key name looks right; double-check end-to-end wiring.
l0_dgx_b300
matches the context name referenced in the Jenkins mapping per the PR summary. Please re-verify that the mapping points to this exact key and that the pod label token matches your wildcards.Use this quick wiring check:
#!/bin/bash set -euo pipefail # Mapping contains the context key rg -nC2 $'\"DGX_B300-4_GPUs-PyTorch-1\"\\s*:\\s*\\[.*\\b(l0_dgx_b300)\\b' jenkins/L0_Test.groovy # YAML defines the context key rg -n $'^\\s*l0_dgx_b300:' tests/integration/test_lists/test-db/l0_dgx_b300.yml
🧹 Nitpick comments (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (1)
11-11
: Review linux_distribution_name wildcard for consistency with CI Ubuntu version
We’ve observed that tests undertests/integration/test_lists/test-db/
(includingl0_dgx_b300.yml
at line 11) all uselinux_distribution_name: ubuntu*which will match any Ubuntu release. To guard against accidental drift when newer Ubuntu versions land in CI, please verify the exact Ubuntu version your CI agents are running (e.g., 22.04). If they are pinned to Ubuntu 22.04, consider tightening the pattern as follows:
- linux_distribution_name: ubuntu* + linux_distribution_name: ubuntu22.*• File: tests/integration/test_lists/test-db/l0_dgx_b300.yml
• Line: 11
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/test-db/l0_dgx_b300.yml
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/test_lists/test-db/l0_dgx_b300.yml (3)
3-7
: GPU count filter is correct for a 4‑GPU stage.
system_gpu_count: gte: 4 / lte: 4
aligns with the intended"DGX_B300-4_GPUs-PyTorch-1"
job selection. No change needed.
1-1
: All test-db YAML files consistently use schema version 0.0.1I’ve confirmed via ripgrep that every file under
tests/integration/test_lists/test-db/
(includingl0_dgx_b300.yml
) declaresversion: 0.0.1
, matching the rest of the suite. No alignment changes are needed.
15-17
: Multi-GPU deepseek test selector is valid and includes 4-GPU case
- The directory
tests/unittest/_torch/multi_gpu_modeling
exists and containstest_deepseek.py
andtest_llama4.py
.- In
test_deepseek.py
, thetest_deepseek_streaming
function is parametrized withtp_size
values[1, 4]
, so runningwill execute both the single-GPU and the 4-GPU variants (skipping only if fewer than 4 GPUs are available).pytest unittest/_torch/multi_gpu_modeling -k "deepseek"
No further changes needed.
wildcards: | ||
gpu: | ||
- '*gb110*' | ||
linux_distribution_name: ubuntu* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
GPU wildcard likely mismatched; will filter out DGX_B300 nodes.
Using '*gb110*'
under wildcards.gpu
probably won’t match the infra label used by the Jenkins mapping (e.g., "dgx-b300-x4"
or a *b300*
pattern). Result: no nodes selected, zero tests scheduled.
Suggest matching the platform label and/or covering both patterns:
wildcards:
gpu:
- - '*gb110*'
+ - '*b300*'
+ - '*gb200*' # optional: if your inventory labels GPUs by GB200
If you prefer to be strict, keep only the one that your cluster actually advertises.
Run this to confirm what the pipeline expects vs what your YAML matches:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Verify the Jenkins stage→context mapping and the pod label token
rg -nC2 $'DGX_B300-4_GPUs-PyTorch-1' jenkins/L0_Test.groovy
# 2) Inspect any B300-related labels in the repo (pod templates, docs, etc.)
rg -nCI3 --hidden --glob '!**/.git/**' '(b300|dgx-b300|gb110|gb200)' -S
# 3) Spot-check other test-db entries to learn the convention for GPU wildcards
rg -nC2 $'^\\s*wildcards:\\s*\\n\\s*gpu:' tests/integration/test_lists/test-db
Length of output: 517
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1. List DGX-related YAMLs in test-db
ls tests/integration/test_lists/test-db | grep l0_dgx
# 2. Show the wildcard block in l0_dgx_b300.yml
sed -n '1,200p' tests/integration/test_lists/test-db/l0_dgx_b300.yml | sed -n '/wildcards:/,/^[^ ]/!d'
# 3. If present, show the wildcard block in l0_dgx_h200.yml for comparison
if [ -f tests/integration/test_lists/test-db/l0_dgx_h200.yml ]; then
sed -n '1,200p' tests/integration/test_lists/test-db/l0_dgx_h200.yml | sed -n '/wildcards:/,/^[^ ]/!d'
fi
Length of output: 411
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Jenkins mapping for DGX_B300 ==="
grep -R "DGX_B300-4_GPUs-PyTorch-1" -n jenkins/L0_Test.groovy || echo "No match found"
echo -e "\n=== GPU wildcards in l0_dgx_b300.yml ==="
grep -R -n "wildcards:" -A5 tests/integration/test_lists/test-db/l0_dgx_b300.yml || echo "No wildcards block found"
echo -e "\n=== GPU wildcards in other DGX YAMLs ==="
for file in tests/integration/test_lists/test-db/l0_dgx_b200.yml \
tests/integration/test_lists/test-db/l0_dgx_h100.yml \
tests/integration/test_lists/test-db/l0_dgx_h200.yml; do
echo "File: $file"
grep -R -n "wildcards:" -A5 "$file" || echo " No wildcards block found"
echo
done
Length of output: 3045
Fix GPU wildcard in l0_dgx_b300.yml
The current GPU wildcard '*gb110*'
does not match the node label "dgx-b300-x4"
used by Jenkins (see mapping in jenkins/L0_Test.groovy
). As a result, no nodes are selected and tests never run.
Update the wildcard to align with other DGX entries:
wildcards:
gpu:
- - '*gb110*'
+ - '*b300*'
• File: tests/integration/test_lists/test-db/l0_dgx_b300.yml, lines 8–11
• Confirm that the Jenkins mapping
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", …]
matches the '*b300*'
pattern.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
wildcards: | |
gpu: | |
- '*gb110*' | |
linux_distribution_name: ubuntu* | |
wildcards: | |
gpu: | |
- '*b300*' | |
linux_distribution_name: ubuntu* |
🤖 Prompt for AI Agents
In tests/integration/test_lists/test-db/l0_dgx_b300.yml around lines 8 to 11 the
GPU wildcard '*gb110*' does not match the Jenkins node label "dgx-b300-x4",
causing no nodes to be selected; update the wildcard to '*b300*' so it matches
the Jenkins mapping (e.g. "DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4",
"l0_dgx_b300", …]) and verify the file now selects the intended DGX B300 nodes.
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1-1
: Do not merge with forked shared library; pin to official branch/commit.Referencing a user fork in @Library is a supply-chain and stability risk for mainline. Upstream the B300 changes to the official shared lib and point to
@main
(or an immutable commit SHA) before merge.Apply this diff:
-@Library(['bloom-jenkins-shared-lib@user/yiqingy/dgx_b300', 'trtllm-jenkins-shared-lib@main']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _Run to confirm no stray fork refs remain:
#!/bin/bash rg -n "bloom-jenkins-shared-lib@user" -C2 || true
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
684-686
: Add a safe fallback if selectGPU lacks a mapping for new GPU aliasesIf
KubernetesManager.selectGPU(type)
doesn’t yet knowdgx-b300-x4
, pod scheduling will fail. Add a minimal fallback to strip the-xN
suffix and use the base GPU type. This is non-invasive and improves robustness while you upstream the mapping to the shared lib.Apply this diff:
- def gpuType = KubernetesManager.selectGPU(type) + def gpuType = KubernetesManager.selectGPU(type) + if (!gpuType) { + // Fallback: map "<gpu>-xN" -> "<gpu>" so scheduling can proceed + gpuType = (type ?: "").replaceAll(/-x\d+$/, "") + }If you prefer not to change behavior here, confirm the shared lib already maps
dgx-b300-x4
todgx-b300
so this path is never hit.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(3 hunks)tests/integration/test_lists/test-db/l0_dgx_b300.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_dgx_b300.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
689-695
: Add DGX B300 to “no dynamic driver flashing” list — LGTMIncluding
type.contains("dgx-b300")
ensures we select bynvidia.com/gpu_type
only (no driver pin), which aligns with other DGX classes. This should prevent scheduling attempts that require on-the-fly driver flashing on DGX B300 hosts.Please confirm the DGX B300 nodes expose the
nvidia.com/gpu_type
label that matches the value returned byKubernetesManager.selectGPU("dgx-b300-x4")
(typicallydgx-b300
). If not, jobs will remain unschedulable.
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", 1, 1, 4], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Stage entry looks correct; verify matching test-db context and constraints exist
The new stage wiring:
- Name:
DGX_B300-4_GPUs-PyTorch-1
- Platform:
dgx-b300-x4
- Test context:
l0_dgx_b300
- GPU count: 4
Ensure the test-db contains a matching context with correct root key and constraints; otherwise the rendered test list will be empty and the stage will no-op/fail.
Run:
If you’d like, I can draft a tests/integration/test_lists/test-db/l0_dgx_b300.yml
modeled on existing DGX contexts (H200/H100), with system_gpu_count: 4
, stage: pre_merge
, and PyTorch filters.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Context file exists
fd -a 'l0_dgx_b300.yml' tests || true
# 2) Root key matches the context name
rg -n -C2 $'^\s*l0_dgx_b300:' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
# 3) Basic constraints sanity (tune as needed)
rg -n -C1 'system_gpu_count\s*:\s*4' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
rg -n -C1 'stage\s*:\s*pre_merge' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
rg -n -C1 'backend\s*:\s*pytorch' tests/integration/test_lists/test-db/l0_dgx_b300.yml || true
Length of output: 673
Missing GPU count constraint in test‐db context
The context file tests/integration/test_lists/test-db/l0_dgx_b300.yml
is present and the root key l0_dgx_b300:
is correct. The stage: pre_merge
and backend: pytorch
constraints are also defined. However, there is no system_gpu_count: 4
term, so this context won’t match the new stage and the pipeline will no-op or fail.
Please add the GPU count constraint under the terms:
section. For example:
terms:
stage: pre_merge
+ system_gpu_count: 4
backend: pytorch
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
"DGX_B300-4_GPUs-PyTorch-1": ["dgx-b300-x4", "l0_dgx_b300", 1, 1, 4], | |
] | |
terms: | |
stage: pre_merge | |
system_gpu_count: 4 | |
backend: pytorch |
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 1879-1880, the mapping references
l0_dgx_b300 but the test-db context file
tests/integration/test_lists/test-db/l0_dgx_b300.yml lacks the GPU-count
constraint; open that YAML and under the l0_dgx_b300: -> terms: section add
system_gpu_count: 4 with correct indentation so the context matches the
DGX_B300-4_GPUs-PyTorch-1 entry and the stage/backend constraints.
Summary by CodeRabbit
New Features
Tests
Bug Fixes
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.