-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][infra] Prepare for single GPU GB200 test pipeline #7073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdded multi-node SLURM orchestration and per-node workspace/docker handling in Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Pipeline
participant L0 as L0_Test.groovy
participant SLURM
participant Node
participant Docker
Pipeline->>L0: runLLMTestlistOnSlurm_MultiNodes(..., nodeCount=N)
note right of L0 #E8F6FF: generate per-node names/workspaces\nassemble per-node dockerArgs
L0->>SLURM: request N nodes
SLURM-->>L0: return node list
loop per node
L0->>Node: create `/tmp/${nodeName}`, log env/workspace
L0->>Node: prepare dockerArgs (add NVIDIA_IMEX_CHANNELS for dlcluster)
L0->>Docker: runInDockerOnNodeMultiStage(dockerArgs, multiStage=true)
Docker->>Node: start container, mount per-node workspace
Docker->>L0: execute tests, return results
L0->>Node: upload results, dump env/workspace, cleanup (`rm -rf *`)
end
L0->>Pipeline: aggregate/upload results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" |
PR_Github #15868 [ run ] triggered by Bot |
PR_Github #15868 [ run ] completed with state |
f887372
to
dfdd2a3
Compare
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" |
PR_Github #15884 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
jenkins/L0_Test.groovy
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(0 hunks)tests/integration/test_lists/test-db/l0_dgx_h200.yml
(1 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_h200.yml
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_h200.yml
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/integration/test_lists/test-db/l0_dgx_h200.yml (1)
169-184
: LGTM: DGX H200 4-GPU Triton post-merge gate validated with smoke testAll gating conditions and the smoke-test invocation have been verified:
- The Python test exists at
tests/integration/defs/triton_server/test_triton_llm.py
and definestest_llmapi_backend
with the expected parameters.- The YAML block in
tests/integration/test_lists/test-db/l0_dgx_h200.yml
(lines 169–184) correctly targets:
• system_gpu_count == 4
• wildcards: gpu ‘h200’, linux_distribution_name ‘ubuntu*’, cpu ‘x86_64’
• terms: stage ‘post_merge’, backend ‘triton’- The test invocation
triton_server/test_triton_llm.py::test_llmapi_backend[4-0-disableDecoupleMode-tensorrt_llm]
matches the file and function signature.- No waive/skip annotations are present around this block.
Approve changes.
tests/integration/test_lists/test-db/l0_gb200.yml (5)
6-7
: Confirm single-GPU (1/1) gating for GB200 pre-merge PyTorchVerified that the Jenkins L0_Test.groovy entries for GB200-PyTorch-1/2/3 (lines 1890–1892) all target
["gb200-unrestricted", "l0_gb200"]
, matching our single-GPU threshold. The YAML gating in tests/integration/test_lists/test-db/l0_gb200.yml (lines 6–7) is correctly set to:
- gte: 1
- lte: 1
All looks consistent—approving these changes.
106-118
: Triton post-merge block for GB200: gating validated and approvedAll referenced Triton tests exist under 1-GPU aarch64 gating:
- tests/integration/defs/triton_server/test_triton_llm.py
• test_llava
• test_llava_onevision- tests/integration/defs/triton_server/test_triton.py
• test_gpt_ib_ptuning
• test_gpt_2b_ib_loraLooks good—approved.
82-93
: TRT-only block and CLI flow tests confirmed
Thelte: 1
section in tests/integration/test_lists/test-db/l0_gb200.yml includes thebackend: tensorrt
guard (line 91) and all expectedTestLlama3_8BInstruct::test_nvfp4…
entries (lines 94–98).• Please manually verify that the 1-GPU and 2-GPU splits each cover a roughly equal total test duration.
• Check both blocks in l0_gb200.yml (thegte: 1/lte: 1
and the subsequentgte: 2/lte: 2
sections) for even distribution of long-running tests.
24-41
: Balanced NVFP4 split (25 tests/job) confirmed
- tests/integration/test_lists/test-db/l0_gb200.yml contains 75 entries → evenly splits into 3 batches of 25
NVFP4 coverage expansion looks solid and pre-merge load remains balanced. Continue to monitor wall-time and any long-tail tests, but no changes are required.
17-20
: No update needed for test class references
The classTestLlama3_1_8B
still defines bothtest_nvfp4
andtest_nvfp4_streaming
(lines 37–63 intests/integration/defs/accuracy/test_llm_api_pytorch.py
), so the YAML entries will resolve correctly.Likely an incorrect or invalid review comment.
jenkins/L0_Test.groovy (3)
1839-1845
: LGTM: Introduces DGX_H200 4-GPU Triton post-merge stage wired to l0_dgx_h200The new stage key matches the test-db block you added and uses gpuCount=4, consistent with its gating.
2469-2472
: LGTM: Updated phase-2 gating to include GB200 multi-GPU signsAdding "GB200-4_GPUs" and "GB200-8_GPUs" ensures those stages run in phase 2 with DGX signs logic.
1890-1898
: LGTM: GB200 stage keys verifiedThe regex search confirms that all expected GB200 stage entries are present and correctly named in jenkins/L0_Test.groovy (lines 1890–1897). Approving these changes.
PR_Github #15884 [ run ] completed with state |
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
266-266
: Remove invalid Docker capability flag: --cap-add syslog causes container launch failuresDocker does not accept “--cap-add syslog” (CAP_SYSLOG isn’t available). This has been reported to break Slurm-run container startup.
Apply:
- def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"Also remove the same token in the non-Slurm path if present.
🧹 Nitpick comments (2)
jenkins/Build.groovy (1)
1-1
: Library bump to dev-yanchaol-slurm — suggest pinning to immutable refGood to keep these consistent. To reduce future breakage from branch force-pushes, consider pinning to a tag or commit SHA once validated.
jenkins/L0_Test.groovy (1)
2134-2136
: Bound pip sanity check to 1 hour — consider making it configurableThe timeout is reasonable, but making it a param or env (e.g., PIP_SANITY_TIMEOUT_MIN=60) improves debuggability if flakes occur.
- timeout(time: 1, unit: 'HOURS') { + timeout(time: env.PIP_SANITY_TIMEOUT_MIN ? env.PIP_SANITY_TIMEOUT_MIN.toInteger() : 60, unit: 'MINUTES') { checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
jenkins/Build.groovy
(1 hunks)jenkins/BuildDockerImage.groovy
(1 hunks)jenkins/L0_MergeRequest.groovy
(1 hunks)jenkins/L0_Test.groovy
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
jenkins/BuildDockerImage.groovy (1)
1-1
: Verify Jenkins Shared Library Branch and API Stability
- All Jenkinsfiles in
jenkins/
are now consistently pinned tobloom-jenkins-shared-lib@dev-yanchaol-slurm
alongsidetrtllm-jenkins-shared-lib@main
(confirmed in
jenkins/L0_MergeRequest.groovy
,
jenkins/BuildDockerImage.groovy
,
jenkins/L0_Test.groovy
,
jenkins/Build.groovy
)- No direct references to
com.nvidia.bloom.*
classes were found in the Jenkinsfile directory, suggesting jobs rely on library steps rather than explicit class imports- Next steps:
• Ensure your Jenkins Global Pipeline Library configuration indexes thedev-yanchaol-slurm
branch for bloom-jenkins-shared-lib.
• Run or dry-run affected pipelines to confirm the branch’s API surface matches downstream expectations, catching any missing or changed library methods.jenkins/L0_MergeRequest.groovy (1)
1-1
: Confirm pinned shared library branch presence and compatibilityThe following Jenkinsfiles now reference
bloom-jenkins-shared-lib@dev-yanchaol-slurm
—ensure that this branch exists in your Jenkins shared library configuration and provides the required classes (e.g., KubernetesManager, CloudManager, SlurmConfig). If you introduced this pin for GB200 support, consider using a specific tag or commit SHA to avoid unintended drift.• jenkins/L0_Test.groovy
• jenkins/BuildDockerImage.groovy
• jenkins/Build.groovy
• jenkins/L0_MergeRequest.groovyjenkins/L0_Test.groovy (4)
2471-2474
: Phase-2 gating extended with GB200 multi-GPU signs — aligns with new stagesIncluding "GB200-4_GPUs" and "GB200-8_GPUs" ensures multi-GPU phases are separated as intended. Looks good.
1839-1839
: DGX_H200 4-GPU Triton post-merge stage verified
- Confirmed that
tests/integration/test_lists/test-db/l0_dgx_h200.yml
is present in this PR.- Verified the
backend: triton
entry at line 181 (and the Triton tests section at line 183) in that YAML file.- The stage tuple
["dgx-h200-x4", "l0_dgx_h200", 1, 1, 4]
adheres to the[platform, yaml, split_id, split_count, gpu_count]
schema and aligns with the newl0_dgx_h200
content.LGTM.
1-1
: Verify Jenkins Global Library branch configurationAll four Jenkinsfiles have been updated to reference the
dev-yanchaol-slurm
branch of thebloom-jenkins-shared-lib
:
jenkins/L0_Test.groovy
jenkins/L0_MergeRequest.groovy
jenkins/BuildDockerImage.groovy
jenkins/Build.groovy
Please confirm on the Jenkins controller that:
- The
dev-yanchaol-slurm
branch is configured under Global Pipeline Libraries.- It contains the GB200-related Slurm and Kubernetes helper functions expected by these pipelines.
No code changes are required in this PR beyond what’s already mirrored; this step ensures the new branch will be resolvable at runtime.
1890-1898
: Validate GB200 Slurm partitions and test-DB entries
- Tests reference two new platforms in
jenkins/L0_Test.groovy
:
- Lines 1890–1897: “gb200-unrestricted” and “gb200-x4” are used as keys into
SlurmConfig.partitionConfig
.- Lines 219 and 295: these lookups occur in
runLLMTestlistOnSlurm
andrunLLMTestlistOnSlurm_MultiNodes
.- The test-DB YAML files exist and include the correct backends:
tests/integration/test_lists/test-db/l0_gb200.yml
covers pytorch, tensorrt, and triton.tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
covers pytorch on 4 GPUs.- The actual definitions of
SlurmConfig.partitionConfig
(andSlurmConfig.clusterConfig
) live in the shared library, not this repo.Please manually verify that in the shared library:
SlurmConfig.partitionConfig
includes entries for
• “gb200-unrestricted”
• “gb200-x4”SlurmConfig.clusterConfig
maps those partitions to valid clusters.- The cluster definitions support the requested GPU counts and backends.
PR_Github #15986 [ run ] triggered by Bot |
PR_Github #15986 [ run ] completed with state |
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast |
PR_Github #15994 [ run ] triggered by Bot |
6167b28
to
f39899e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1745-1745
: Same invalid flag in non-Slurm dockerArgs. Remove “--cap-add syslog”.Mirror the fix at Line 266 here as well to prevent failures when running on a docker node.
- def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
266-266
: Fix invalid Docker capability flag: “--cap-add syslog” breaks container launch.Docker doesn’t recognize “syslog”; capabilities must match kernel names (e.g., SYS_ADMIN). This flag causes container start failures. Remove it.
Apply:
- def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"Follow-up: consider centralizing dockerArgs in one helper to avoid future drift between call sites.
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
2134-2136
: Timeout for pip-install sanity: add explicit timeout diagnostics.The 1h timeout prevents indefinite stalls—nice. To aid triage, add a message on timeout so it’s obvious in logs why this stage aborted.
- timeout(time: 1, unit: 'HOURS') { - checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") - } + try { + timeout(time: 1, unit: 'HOURS') { + checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") + } + } catch (InterruptedException e) { + echo "checkPipInstall timed out after 1 hour for ${cpu_arch}/${wheelPath}" + throw e + }
2471-2474
: Phase-2 gating: include GB200 signs and improve readability.Good catch adding GB200-4/8_GPUs to the phase-2 split. Minor readability nit: avoid nested implicit it in closures.
- def dgxSigns = ["DGX_H100", "DGX_H200", "GB200-4_GPUs", "GB200-8_GPUs", "DGX_B200"] - singleGpuJobs = parallelJobs.findAll{!dgxSigns.any{sign -> it.key.contains(sign)}} - dgxJobs = parallelJobs.findAll{dgxSigns.any{sign -> it.key.contains(sign)}} + def dgxSigns = ["DGX_H100", "DGX_H200", "GB200-4_GPUs", "GB200-8_GPUs", "DGX_B200"] + singleGpuJobs = parallelJobs.findAll { entry -> !dgxSigns.any { s -> entry.key.contains(s) } } + dgxJobs = parallelJobs.findAll { entry -> dgxSigns.any { s -> entry.key.contains(s) } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
jenkins/Build.groovy
(1 hunks)jenkins/BuildDockerImage.groovy
(1 hunks)jenkins/L0_MergeRequest.groovy
(1 hunks)jenkins/L0_Test.groovy
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- jenkins/Build.groovy
- jenkins/L0_MergeRequest.groovy
- jenkins/BuildDockerImage.groovy
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
1-1
: Pin shared library ref (avoid drifting dev branch).Switching bloom-jenkins-shared-lib to dev-yanchaol-slurm looks intentional for SLURM work, but a moving dev branch can break the pipeline unexpectedly. Prefer a tagged release or commit SHA to guarantee reproducibility. Confirm the branch is long-lived and compatible with other Jenkinsfiles updated in this PR.
Would you like me to open a follow-up to pin this to a specific commit across all Jenkinsfiles for consistency?
1839-1845
: The script above will dump the DGX H200 YAML so we can confirm that:
- The
l0_dgx_h200.yml
file defines the “Triton”, “PyTorch”, and “TensorRT” post-merge subsets.- The
system_gpu_count
andranges
entries match the 1-of-1 and 1..3-of-3 splits you wired inL0_Test.groovy
.Once we validate those, we can conclude whether everything is correctly wired or if further adjustments are needed.
PR_Github #15994 [ run ] completed with state |
f39899e
to
a8b9716
Compare
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast |
PR_Github #16000 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1745-1745
: Remove second occurrence of “--cap-add syslog” in dockerArgsThere’s another dockerArgs with “--cap-add syslog” used by non-Slurm runs. Remove it to avoid Docker run failures there as well.
- def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
266-266
: Remove invalid Docker capability flag: “--cap-add syslog” breaks container launchThis was flagged previously and still appears here. Docker rejects “--cap-add syslog”; remove it to fix SLURM runner container starts.
- def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)
24-40
: Large NVFP4 DeepSeekV3Lite matrix: consider param compactionThis enumerates many orthogonal dims (moe_backend, fp8kv, attention_dp, cuda_graph, overlap_scheduler, torch_compile, mtp_nextn). If maintenance becomes painful, consider moving these to a generated include (e.g., Mako/templated expansion) to reduce churn and drift. Not blocking.
jenkins/L0_Test.groovy (1)
2135-2137
: 1-hour timeout around pip-install sanity: good guardrailThis prevents hangs in the checkPipInstall step from consuming the entire job window. Keep an eye on flakes; you might reduce to 30m once stability is confirmed.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
jenkins/Build.groovy
(1 hunks)jenkins/BuildDockerImage.groovy
(1 hunks)jenkins/L0_MergeRequest.groovy
(1 hunks)jenkins/L0_Test.groovy
(6 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(0 hunks)tests/integration/test_lists/test-db/l0_dgx_h200.yml
(1 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(2 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (5)
- jenkins/L0_MergeRequest.groovy
- jenkins/BuildDockerImage.groovy
- tests/integration/test_lists/test-db/l0_dgx_h200.yml
- jenkins/Build.groovy
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_gb200.yml (5)
6-7
: Correct: narrow single-GPU gating for GB200 L0 suiteUpdating system_gpu_count to gte:1/lte:1 matches the “Single-GPU” intent for this YAML and avoids pulling 4-GPU tasks into this list.
17-21
: Sanity on Llama3.1 NVFP4 entriesThe new NVFP4 coverage for Llama3.1 8B (including two streaming intervals) is appropriate for GB200 bring-up and aligns with the rest of the NVFP4-centric matrix in this file.
59-62
: Torch “unittest/_torch … -k …” notations: double-check runner parsingThese lines rely on the internal pytest wrapper recognizing raw “path -k expr” forms (not test nodeids). This is consistent with prior lists, but please confirm the test-db renderer and the defs runner still accept these exact forms after recent refactors.
Would you like a quick repo scan script that checks each listed path exists and flags missing files/classes?
Also applies to: 70-78
90-105
: TRT pre-merge block looks good
- backend: tensorrt
- stage: pre_merge
- CLI-flow and unit tests chosen are representative for engine build and NVFP4/GEMM-plugin coverage.
119-122
: Good Triton smoke set for GB200The three Triton functional tests are reasonable post-merge coverage on single GPU. With the backend/stage fix above, they should render properly.
jenkins/L0_Test.groovy (3)
1-1
: Pin shared library to a commit/tag to avoid pipeline driftMoving bloom-jenkins-shared-lib to dev-yanchaol-slurm is fine for experimentation, but dev branches can change underfoot and break reproducibility. Recommend pinning to a stable tag or commit SHA.
Example:
-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-slurm', 'trtllm-jenkins-shared-lib@main']) _ +@Library(['bloom-jenkins-shared-lib@dev-yanchaol-slurm/<commit-sha>', 'trtllm-jenkins-shared-lib@main']) _Would you like a quick check that all Jenkinsfiles in this PR consistently use the same lib and ref?
1841-1841
: Add DGX_H200-4_GPUs Triton post-merge: goodMatches the broader matrix shift toward H200 4-GPU Triton post-merge. No concerns.
2472-2474
: Phase-2 gating list updated to include GB200 multi-GPU: goodThe dgxSigns now capture GB200 4- and 8-GPU groups so the phase-2 stage properly separates single- and multi-GPU runs.
a8b9716
to
edf9081
Compare
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)
1784-1784
: Remove the other “--cap-add syslog” in dockerArgs (dockerNode path).There’s still an occurrence used when running sanity on a static docker node. It will similarly break container startup.
Apply:
-def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" +def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
285-293
: Remove invalid Docker capability “--cap-add syslog” (breaks container start).Docker does not accept “syslog” as a capability. This causes container launch failures in Slurm runs.
Apply:
def dockerArgs = "--gpus ${gpuCount} " + "--cap-add=SYS_ADMIN " + "--ipc=host " + "--security-opt seccomp=unconfined " + "-u root:root " + "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " + "-v /tmp/ccache:${CCACHE_DIR}:rw " + "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " + - "--cap-add syslog" + ""Note: There’s another occurrence in launchTestJobs’ dockerArgs used for dockerNode runs; please remove it as well (see comment below).
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)
17-20
: Pre-merge PyTorch grid validated — tests exist and count ~87
- All three Llama3.1 NVFP4 tests resolve to existing entries in
l0_gb200.yml
(lines 18–20) and companion lists (e.g.l0_b200.yml
,qa/llm_function_sanity.txt
)- DeepSeek V3 Lite NVFP4 variants, Qwen3 NVFP4 permutations, GPT-OSS W4 tests, and the E2E cache-router eviction test all map to concrete entries in the test-db
- Approximate count of pre-merge PyTorch tests in this block: 87
Optional refactor recommendation (if CI runtime is a concern):
- Move one of the two Llama3.1 streaming intervals (
stream_interval_4
orstream_interval_64
) to post-merge- Gate a subset of DeepSeek V3 “chunked_prefill” permutations behind a post-merge flag
Also applies to lines 24–72.
jenkins/L0_Test.groovy (1)
274-284
: Debug dumps are helpful; minor nit on env vars.The env dumps aid triage. If these lines ever move under withEnv, prefer $WORKSPACE/$WORKSPACE_TMP to reflect overridden values at runtime.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
jenkins/L0_Test.groovy
(11 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(3 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (13)
tests/integration/test_lists/test-db/l0_gb200.yml (4)
6-7
: Single‑GPU gating set correctly (gte:1, lte:1).Matches the PR goal of migrating single‑GPU GB200 tests and aligns with the Jenkins stage definitions for GB200 single‑GPU runs.
76-85
: Post‑merge TensorRT block looks consistent with CLI flow and TRT unit tests.
- Backend set to tensorrt with stage post_merge and single‑GPU gating on GB200 aarch64 — consistent with the Jenkins “GB200‑TensorRT‑Post‑Merge‑{1,2}” stages.
- Mix of CLI flow and TRT unit tests covers both builder and runtime paths.
Also applies to: 87-99
100-111
: Triton block now correctly uses backend=triton and post_merge.This fixes the earlier backend/stage mismatch that would have yielded an empty list for “GB200‑Triton‑Post‑Merge‑1.” Good catch.
Also applies to: 113-116
118-121
: Post‑merge PyTorch additions align with NVFP4 matrices.Reasonable breadth across CUTLASS/TRTLLM, attention_dp, and fp8kv toggles for single‑GPU GB200. No issues spotted.
Also applies to: 131-144
jenkins/L0_Test.groovy (9)
1-1
: Pinning shared library branch to a dev branch — verify availability and stability.Using bloom-jenkins-shared-lib@dev-yanchaol-slurm is fine for experimentation, but please confirm this ref exists on the Jenkins controller and won’t be GC’d or force‑pushed.
If this is long‑lived, consider pinning to a tag or commit SHA to avoid accidental breakages from branch moves.
214-219
: Workspace cleanup after result packaging looks safe.This executes after cacheErrorAndUploadResult has archived results. No objections.
228-233
: Per‑node suffix and custom workspace improve isolation.Unique node names and /tmp workspaces reduce cross‑job interference on the Slurm front‑end. LGTM.
299-299
: Multi‑stage Docker runner usage is correct here.Passing true to delete the workspace before the run avoids leftover artifacts when reusing nodes.
827-827
: Doc: Step label updated; no functional impact.Comment tweak only. No action needed.
1272-1282
: Early workspace cleanup before test setup is fine; ensure no required inputs reside in CWD.Given sources and artifacts are fetched later into ${LLM_ROOT}${config}, the rm -rf here is safe.
1879-1884
: New DGX H200 post‑merge stages — consistent split and GPU gating.Adds Triton and TensorRT splits with 4‑GPU gating. Matches recent matrix expansion; no issues spotted.
2173-2174
: Add 1‑hour timeout around checkPipInstall — good guardrail.Avoids hangs in pip sanity. Looks good.
1932-1941
: Verification Required: Slurm Partitions and Test-DB Entry
I couldn’t locate a
SlurmConfig
class or apartitionConfig
assignment in this repository—these are likely defined in your shared pipeline library.
→ Please confirm that the partitions “gb200-unrestricted” and “gb200-x4” are indeed defined inSlurmConfig.partitionConfig
.Confirmed presence of the test-DB file:
•tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
exists on this branch.
PR_Github #16278 [ run ] completed with state |
426d490
to
9de4407
Compare
/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-Post-Merge-1, GB200-TensorRT-Post-Merge-2, GB200-Triton-Post-Merge-1, GB200-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16281 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_gb200.yml (1)
100-116
: Triton backend/stage now correct and aligned with “GB200‑Triton‑Post‑Merge‑1”.Previous backend/stage mismatch is resolved (backend=triton, stage=post_merge). This will prevent the empty selection issue noted earlier.
🧹 Nitpick comments (3)
tests/integration/test_lists/test-db/l0_gb200.yml (3)
87-99
: TRT test selection provides engine, quant, and GEMM coverage; consider one GB200‑specific plugin smoke if available.Current set is solid. If there’s a lightweight GB200‑specific kernel/plugin sanity (e.g., FP4 GEMM on GB200 path, or KV‑cache plugin), adding one micro‑test can shorten failure signal on regressions.
76-144
: DRY optional: factor repeated condition clauses via YAML anchors (future maintenance).The three single‑GPU GB200 Grace/Ubuntu condition blocks repeat the same wildcards/ranges. Using YAML anchors/aliases would reduce duplication and the chance of drift, but it’s optional if the team prefers explicit blocks.
Example:
-l0_gb200: -- condition: - ranges: - system_gpu_count: - gte: 1 - lte: 1 - wildcards: - gpu: - - '*gb200*' - linux_distribution_name: ubuntu* - cpu: aarch64 +common_single_gb200: &common_single_gb200 + ranges: + system_gpu_count: { gte: 1, lte: 1 } + wildcards: + gpu: ['*gb200*'] + linux_distribution_name: ubuntu* + cpu: aarch64 + +l0_gb200: +- condition: + <<: *common_single_gb200 terms: stage: pre_merge backend: pytorch tests: … - condition: - ranges: - system_gpu_count: - gte: 1 - lte: 1 - wildcards: - gpu: - - '*gb200*' - linux_distribution_name: ubuntu* - cpu: aarch64 + condition: + <<: *common_single_gb200 terms: stage: post_merge backend: tensorrt tests: …
17-144
: Balance GB200 shards in l0_gb200.yml to reduce CI flakinessThe shard counts are currently very uneven, which can lead to timeouts and instability:
- Block 1: 57 test selectors
- Block 2: 13 test selectors
- Block 3: 5 test selectors
- Block 4: 14 test selectors
Please redistribute tests more evenly—e.g., split the largest block or move some heavy tests into under-utilized shards—to help stabilize CI runtime.
File: tests/integration/test_lists/test-db/l0_gb200.yml
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
jenkins/L0_Test.groovy
(11 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(3 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
- jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tests/integration/test_lists/test-db/l0_gb200.yml (5)
51-57
: Nice inclusion of e2e quickstarts for single‑GPU GB200.The curated PTP e2e tests cover NVFP4, FP8, MTP, EAGLE3, and n‑gram. This gives good breadth without invoking multi‑GPU paths.
76-86
: Post‑merge TRT condition is well‑scoped for GB200 single‑GPU on Grace/Ubuntu.The stage/backend/wildcards combination matches the intended GB200 TRT single‑GPU runner profile.
6-7
: Multi-GPU flag scan completed; no issues detected.
- The gating file tests/integration/test_lists/test-db/l0_gb200.yml lists only single-GPU parameters (
gte: 1
,lte: 1
) and contains no “4gpu”, “multi-GPU”, or TP/PP/EP >1 hints.- A full-text sweep of the tests directory for tensor_parallel, pipeline_parallel, expert_parallel, tp, pp, or ep assignments >1 yielded no matches.
Based on these results, the single-GPU migration gating is correctly implemented and there are no unintended multi-GPU dependencies.
17-49
: No additional baseline needed for TestLlama3_1_8B
TheTestLlama3_1_8B
class already defines both a non-quantized baseline (test_bfloat16
) and an FP8 baseline (test_fp8
), so the suggestion to add one of these tests is redundant. No further changes are required.Likely an incorrect or invalid review comment.
118-144
: No additional test needed—torch_compile=True for TRTLLM is already covered.The
test_nvfp4
method is decorated with:
@parametrize_with_ids("fp8kv,attention_dp,cuda_graph,overlap_scheduler", [(False, False, True, False), …])
@parametrize_with_ids("torch_compile", [False, True])
@parametrize_with_ids("mtp_nextn", [0, 2])
@parametrize_with_ids("moe_backend", ["CUTLASS", "TRTLLM"])
This generates the Cartesian product including the combination
(fp8kv=False, attention_dp=False, cuda_graph=True, overlap_scheduler=False, torch_compile=True, mtp_nextn=0, moe_backend=TRTLLM)
. No change is necessary.Likely an incorrect or invalid review comment.
PR_Github #16281 [ run ] completed with state |
Signed-off-by: Yanchao Lu <[email protected]>
9de4407
to
8c29407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
jenkins/scripts/slurm_run.sh (2)
69-76
: coverageConfigFile is undefined here — use the actual file path you createdYou create $jobWorkspace/.coveragerc above, but later refer to $coverageConfigFile which is never set, causing sed and pytest-cov to misbehave.
-sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile" +sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$jobWorkspace/.coveragerc" @@ - "--cov-config=$coverageConfigFile" + "--cov-config=$jobWorkspace/.coveragerc"
66-68
: Don’t mutate paths with sed; trim whitespace insteadThe pattern s/[[:space:]]+/_/g is incorrect in sed BRE and replaces whitespace with underscores, corrupting paths. Just strip whitespace.
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') +trtllmWhlPath=$(echo "$trtllmWhlPath" | tr -d '[:space:]') @@ -containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') +containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | tr -d '[:space:]') -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') +containerLDLibPath=$(echo "$containerLDLibPath" | tr -d '[:space:]')Also applies to: 77-80
♻️ Duplicate comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)
52-52
: NV-bugged pre-merge test will block CI — mark xfail or move to post-mergedisaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction is annotated with “# nvbugs 5300551” but not xfail/skip. It will fail pre-merge.
Options:
- Add xfail in the test file with reason “nvbugs 5300551”.
- Or remove this selector from pre_merge here and include it only in a post_merge block.
jenkins/L0_Test.groovy (1)
284-292
: Remove invalid Docker flag: “--cap-add syslog” breaks container startDocker doesn’t accept “--cap-add syslog” (kernel caps are uppercase; CAP_SYSLOG isn’t generally available). This will fail docker.image(...).inside().
def dockerArgs = "--gpus ${gpuCount} " + "--cap-add=SYS_ADMIN " + "--ipc=host " + "--security-opt seccomp=unconfined " + "-u root:root " + "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " + "-v /tmp/ccache:${CCACHE_DIR}:rw " + - "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " + - "--cap-add syslog" + "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw "Note: There’s another occurrence later in launchTestJobs (around Line 1789). Please remove that too.
🧹 Nitpick comments (3)
jenkins/scripts/slurm_run.sh (1)
89-92
: Avoid eval; execute the array directlyeval risks word-splitting and injection. You already built an argv array; invoke it as-is.
-fullCmd="${testCmdLines[*]}" -echo "Running: $testCase" -echo "Full Command: $fullCmd" -eval $fullCmd +echo "Running: $testCase" +printf 'Full Command (argv): %q ' "${testCmdLines[@]}"; echo +"${testCmdLines[@]}"tests/integration/test_lists/test-db/l0_gb200.yml (1)
1-2
: Header comment is misleadingThe file header says “not used in the pipeline now,” but this YAML defines concrete GB200 single-GPU suites. Either wire it up in Jenkins or update the comment to reflect current usage to avoid confusion.
Would you like me to wire this into a Jenkins stage (e.g., GB200-Single-GPU-PyTorch-1) or remove the header note?
jenkins/L0_Test.groovy (1)
214-219
: Duplicate workspace cleanup — consolidate to one placeYou added a “Clean up the workspace” here and also inside cacheErrorAndUploadResult (Lines 591–596). Two rm -rf passes are redundant and can complicate debugging.
Consider removing this copy and keeping cleanup centralized inside cacheErrorAndUploadResult.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
jenkins/L0_Test.groovy
(11 hunks)jenkins/scripts/slurm_run.sh
(2 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(0 hunks)tests/integration/test_lists/test-db/l0_dgx_h200.yml
(1 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(3 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/test-db/l0_dgx_h200.yml
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/slurm_run.sh
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 37-37: pytestTestTimeout is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/scripts/slurm_run.sh (1)
36-38
: Provide a default for pytestTestTimeout (SC2154) and export safelypytestTestTimeout may be unset depending on the caller. Set a sane default before exporting and quote the export.
+pytestTestTimeout="${pytestTestTimeout:-7200}" -export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout +export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout}"If this script is ever run outside the multi-node launcher, this guards against spurious failures.
tests/integration/test_lists/test-db/l0_gb200.yml (2)
8-9
: LGTM: single-GPU gating updatedsystem_gpu_count moved to gte:1/lte:1 for GB200 single-GPU. Matches the intent of single-GPU stages.
102-118
: LGTM: Triton post-merge block now correctly uses backend: tritonThis fixes the earlier backend/stage mismatch; “triton_server/test_triton.py” will now render for GB200-Triton-Post-Merge stages.
jenkins/L0_Test.groovy (4)
228-233
: Per-node name/workspace looks goodUnique node name and custom /tmp workspace reduce collisions on shared clusters.
356-356
: Timeout plumbed for MultiNodespytestTestTimeout is explicitly set to 7200 seconds and exported into slurm_launch.sh. Aligns with slurm_run.sh changes.
2172-2173
: LGTM: bound pip-install sanity to 1 hourBounding checkPipInstall with a timeout avoids indefinite hung builds.
2508-2511
: LGTM: clearer multi-GPU phase separationUsing ["2_GPUs", "4_GPUs", "8_GPUs"] to split phase 2 makes stage grouping explicit and matches the naming.
Signed-off-by: Yanchao Lu <[email protected]>
8c29407
to
2c2fab7
Compare
/bot run --stage-list "GB200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
jenkins/scripts/slurm_run.sh (1)
5-14
: Undefined coverageConfigFile: sed substitution and --cov-config will failThe .coveragerc is created at $jobWorkspace/.coveragerc, but later lines reference $coverageConfigFile, which is never set. sed -i … "$coverageConfigFile" (Line 69) and --cov-config=$coverageConfigFile (Line 75) will fail.
Apply this diff to define and consistently use coverageConfigFile:
-# generate .coveragerc in workspace -cat << EOF > $jobWorkspace/.coveragerc +# generate .coveragerc in workspace +coverageConfigFile="$jobWorkspace/.coveragerc" +cat << EOF > "$coverageConfigFile" [run] branch = True data_file = $jobWorkspace/.coverage.$stageName [paths] source = $llmSrcNode/tensorrt_llm/ ---wheel_path---/tensorrt_llm/ EOFjenkins/L0_Test.groovy (1)
1775-1775
: Same issue elsewhere: drop “--cap-add syslog” from dockerArgsThis recurrence will fail non-Slurm docker runs as well.
Apply:
- def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog" + def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
♻️ Duplicate comments (4)
jenkins/scripts/slurm_run.sh (1)
46-46
: Ack: trailing comma fixed for pytest timeout-methodThe prior “--timeout-method=thread,” issue is resolved; the option is now valid.
tests/integration/test_lists/test-db/l0_gb200.yml (2)
52-52
: NV-bugged test in pre-merge will gatekeep CI; move or xfaildisaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction is tagged with “nvbugs 5300551” and has no xfail/skip in source. Keeping it in pre_merge will likely fail GB200 single‑GPU gating.
Suggested adjustment in this YAML (remove from pre_merge):
- - disaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction[TinyLlama-1.1B-Chat-v1.0] # nvbugs 5300551
Alternatively, mark the test with @pytest.mark.xfail(reason="nvbugs 5300551") in tests/integration/defs/disaggregated/test_workers.py so it can remain in pre_merge without blocking.
112-118
: Good fix: Triton block now matches backend=triton and stage=post_mergeThis resolves the earlier backend/stage mismatch that produced empty stage renders.
jenkins/L0_Test.groovy (1)
277-286
: Remove invalid Docker capability: “--cap-add syslog” breaks container launchDocker does not support a “syslog” capability, and “--cap-add syslog” causes run failures. It also appears as a stray token in the args string.
Apply:
def dockerArgs = "--gpus ${gpuCount} " + "--cap-add=SYS_ADMIN " + "--ipc=host " + "--security-opt seccomp=unconfined " + "-u root:root " + "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " + "-v /tmp/ccache:${CCACHE_DIR}:rw " + - "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " + - "--cap-add syslog" + "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw "
🧹 Nitpick comments (5)
jenkins/scripts/slurm_run.sh (4)
37-37
: Guard CPP_TEST_TIMEOUT_OVERRIDDEN with a sane defaultpytestTestTimeout is passed from the launcher, but shellcheck warns it may be unset (SC2154). Provide a fallback to avoid exporting an empty value.
Apply:
-export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout +export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout:-3600}"
67-67
: sed pattern uses literal “+”; whitespace trimming doesn’t worksed without -E treats “+” literally, so 's/[[:space:]]+/_/g' won’t collapse whitespace. Use tr to strip whitespace/newlines.
Apply:
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') +trtllmWhlPath=$(echo "$trtllmWhlPath" | tr -d '[:space:]')-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') +containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | tr -d '[:space:]') +containerLDLibPath=$(echo "$containerLDLibPath" | tr -d '[:space:]')Also applies to: 78-80
23-23
: Run apt-get update before install to reduce transient failuresInstall may fail on fresh containers without an updated package index.
Apply:
- apt-get install -y libffi-dev + apt-get update -y && apt-get install -y libffi-dev
29-29
: Typo in debug log: double equals in GPU_UUIDSMinor nit: the output currently prints “GPU_UUIDS = =…”.
Apply:
- echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" + echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"tests/integration/test_lists/test-db/l0_gb200.yml (1)
1-2
: Header comment contradicts usage; please clarifyThe header says “Don’t add any tests here” and “not used in the pipeline now,” but this file now defines active pre_merge and post_merge blocks. Please update or remove the misleading comments to prevent accidental edits elsewhere.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
jenkins/L0_Test.groovy
(9 hunks)jenkins/scripts/slurm_run.sh
(2 hunks)tests/integration/test_lists/test-db/l0_gb200.yml
(3 hunks)tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
jenkins/scripts/slurm_run.sh
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
tests/integration/test_lists/test-db/l0_gb200.yml
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 37-37: pytestTestTimeout is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
2157-2159
: Nice: bound checkPipInstall with a 1h timeoutGood defensive guard to prevent long-running artifact queries from stalling the whole pipeline.
1923-1927
: Verify GB200 single‑GPU migration wiring (stage references vs YAMLs)SBSASlurmTestConfigs comments out “GB200-PyTorch-1” and only wires multi‑GPU (l0_gb200_multi_gpus). Given the PR objective is “Migrate B200 single GPU tests to GB200,” please confirm the intended single‑GPU GB200 pre/post stages are added somewhere (or update the comment). Otherwise, the new l0_gb200 single‑GPU content won’t be exercised.
PR_Github #16302 [ run ] triggered by Bot |
PR_Github #16302 [ run ] completed with state |
/bot skip --comment "Partial testing is sufficient" |
PR_Github #16309 [ skip ] triggered by Bot |
PR_Github #16309 [ skip ] completed with state |
Summary by CodeRabbit
New Features
Tests
Chores
Documentation