Skip to content

Conversation

chzblych
Copy link
Collaborator

@chzblych chzblych commented Aug 20, 2025

Summary by CodeRabbit

  • New Features

    • Multi-node Slurm test execution with per-node workspaces, enhanced per-node logging/diagnostics, multi-stage Docker runner, and a new public multi-node test entry.
  • Tests

    • Added DGX H200 (4× GPU) Triton post-merge test; removed DGX H100 (4× GPU) Triton post-merge test.
    • Major GB200 test matrix overhaul: NVFP4 substitutions, expanded 1‑GPU/4‑GPU gating, new GB200 multi‑GPU YAML, and pruning of legacy entries.
  • Chores

    • CI reliability improvements: pip install timeout, configurable test timeout with thread-based method, improved workspace cleanup and logging.
  • Documentation

    • Minor build step comment wording updated.

@chzblych chzblych requested review from a team as code owners August 20, 2025 05:42
@chzblych chzblych requested review from zeroepoch and kxdc August 20, 2025 05:42
Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

📝 Walkthrough

Walkthrough

Added multi-node SLURM orchestration and per-node workspace/docker handling in jenkins/L0_Test.groovy; reorganized GB200/DGX test matrices (added H200, new GB200 multi‑GPU matrix, NVFP4/FP8 substitutions), removed an H100 Triton gate, and made SLURM/pytest timeout and docker arg tweaks.

Changes

Cohort / File(s) Change summary
Jenkins pipeline
jenkins/L0_Test.groovy
Added runLLMTestlistOnSlurm_MultiNodes; per-node unique suffixes and custom workspaces; per-node env/workspace logging and dumps; multi-line dockerArgs and extra logging; append NVIDIA_IMEX_CHANNELS=0 for dlcluster; enable multi-stage docker runner; workspace cleanup after upload; wrap pip install check in 1‑hour timeout; remove duplicate import; small comment/whitespace tweaks.
SLURM runner script
jenkins/scripts/slurm_run.sh
Make CPP_TEST_TIMEOUT_OVERRIDDEN use $pytestTestTimeout; add --timeout-method=thread to pytest invocation.
DGX H100 test list
tests/integration/test_lists/test-db/l0_dgx_h100.yml
Removed a post_merge Triton gating block targeting 4× H100 Ubuntu Triton test.
DGX H200 test list
tests/integration/test_lists/test-db/l0_dgx_h200.yml
Added a new post_merge Triton block for 4× H200 Ubuntu x86_64 running a Triton LLM test.
GB200 single-node matrix
tests/integration/test_lists/test-db/l0_gb200.yml
Lowered pre_merge system_gpu_count gating (4→1); replaced many 4‑GPU PyTorch entries with NVFP4/FP8 variants; reorganized post_merge into separate TensorRT, Triton, and NVFP4 PyTorch blocks with updated OS/CPU constraints.
GB200 multi‑GPU matrix (new)
tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
New file defining 4‑GPU GB200 Ubuntu aarch64 matrix (pre_merge & post_merge) covering multi‑GPU NVFP4/FP8 PyTorch tests and latency variants.
Test matrix adjustments
tests/integration/test_lists/test-db/*
Removed DGX_H100-4_GPUs-Triton-Post-Merge-1; added DGX_H200-4_GPUs-Triton-Post-Merge-1 and GB200 multi‑GPU entries; adjusted phase/grouping signs and removed obsolete GB200 8‑GPU 2‑node entry.
Comments
jenkins/L0_Test.groovy
Step 1 comment updated from "tekit" to "source code".

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Pipeline
    participant L0 as L0_Test.groovy
    participant SLURM
    participant Node
    participant Docker
    Pipeline->>L0: runLLMTestlistOnSlurm_MultiNodes(..., nodeCount=N)
    note right of L0 #E8F6FF: generate per-node names/workspaces\nassemble per-node dockerArgs
    L0->>SLURM: request N nodes
    SLURM-->>L0: return node list
    loop per node
        L0->>Node: create `/tmp/${nodeName}`, log env/workspace
        L0->>Node: prepare dockerArgs (add NVIDIA_IMEX_CHANNELS for dlcluster)
        L0->>Docker: runInDockerOnNodeMultiStage(dockerArgs, multiStage=true)
        Docker->>Node: start container, mount per-node workspace
        Docker->>L0: execute tests, return results
        L0->>Node: upload results, dump env/workspace, cleanup (`rm -rf *`)
    end
    L0->>Pipeline: aggregate/upload results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • kxdc
  • yuanjingx87
  • litaotju
  • Shixiaowei02

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15868 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15868 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11927 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from f887372 to dfdd2a3 Compare August 20, 2025 06:53
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15884 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f887372 and dfdd2a3.

📒 Files selected for processing (5)
  • jenkins/L0_Test.groovy (4 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (0 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (2 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tests/integration/test_lists/test-db/l0_dgx_h200.yml (1)

169-184: LGTM: DGX H200 4-GPU Triton post-merge gate validated with smoke test

All gating conditions and the smoke-test invocation have been verified:

  • The Python test exists at tests/integration/defs/triton_server/test_triton_llm.py and defines test_llmapi_backend with the expected parameters.
  • The YAML block in tests/integration/test_lists/test-db/l0_dgx_h200.yml (lines 169–184) correctly targets:
    • system_gpu_count == 4
    • wildcards: gpu ‘h200’, linux_distribution_name ‘ubuntu*’, cpu ‘x86_64’
    • terms: stage ‘post_merge’, backend ‘triton’
  • The test invocation triton_server/test_triton_llm.py::test_llmapi_backend[4-0-disableDecoupleMode-tensorrt_llm] matches the file and function signature.
  • No waive/skip annotations are present around this block.

Approve changes.

tests/integration/test_lists/test-db/l0_gb200.yml (5)

6-7: Confirm single-GPU (1/1) gating for GB200 pre-merge PyTorch

Verified that the Jenkins L0_Test.groovy entries for GB200-PyTorch-1/2/3 (lines 1890–1892) all target ["gb200-unrestricted", "l0_gb200"], matching our single-GPU threshold. The YAML gating in tests/integration/test_lists/test-db/l0_gb200.yml (lines 6–7) is correctly set to:

  • gte: 1
  • lte: 1

All looks consistent—approving these changes.


106-118: Triton post-merge block for GB200: gating validated and approved

All referenced Triton tests exist under 1-GPU aarch64 gating:

  • tests/integration/defs/triton_server/test_triton_llm.py
    • test_llava
    • test_llava_onevision
  • tests/integration/defs/triton_server/test_triton.py
    • test_gpt_ib_ptuning
    • test_gpt_2b_ib_lora

Looks good—approved.


82-93: TRT-only block and CLI flow tests confirmed
The lte: 1 section in tests/integration/test_lists/test-db/l0_gb200.yml includes the backend: tensorrt guard (line 91) and all expected TestLlama3_8BInstruct::test_nvfp4… entries (lines 94–98).

• Please manually verify that the 1-GPU and 2-GPU splits each cover a roughly equal total test duration.
• Check both blocks in l0_gb200.yml (the gte: 1/lte: 1 and the subsequent gte: 2/lte: 2 sections) for even distribution of long-running tests.


24-41: Balanced NVFP4 split (25 tests/job) confirmed

  • tests/integration/test_lists/test-db/l0_gb200.yml contains 75 entries → evenly splits into 3 batches of 25
    NVFP4 coverage expansion looks solid and pre-merge load remains balanced. Continue to monitor wall-time and any long-tail tests, but no changes are required.

17-20: No update needed for test class references
The class TestLlama3_1_8B still defines both test_nvfp4 and test_nvfp4_streaming (lines 37–63 in tests/integration/defs/accuracy/test_llm_api_pytorch.py), so the YAML entries will resolve correctly.

Likely an incorrect or invalid review comment.

jenkins/L0_Test.groovy (3)

1839-1845: LGTM: Introduces DGX_H200 4-GPU Triton post-merge stage wired to l0_dgx_h200

The new stage key matches the test-db block you added and uses gpuCount=4, consistent with its gating.


2469-2472: LGTM: Updated phase-2 gating to include GB200 multi-GPU signs

Adding "GB200-4_GPUs" and "GB200-8_GPUs" ensures those stages run in phase 2 with DGX signs logic.


1890-1898: LGTM: GB200 stage keys verified

The regex search confirms that all expected GB200 stage entries are present and correctly named in jenkins/L0_Test.groovy (lines 1890–1897). Approving these changes.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15884 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11937 (Partly Tested) completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

266-266: Remove invalid Docker capability flag: --cap-add syslog causes container launch failures

Docker does not accept “--cap-add syslog” (CAP_SYSLOG isn’t available). This has been reported to break Slurm-run container startup.

Apply:

-                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"

Also remove the same token in the non-Slurm path if present.

🧹 Nitpick comments (2)
jenkins/Build.groovy (1)

1-1: Library bump to dev-yanchaol-slurm — suggest pinning to immutable ref

Good to keep these consistent. To reduce future breakage from branch force-pushes, consider pinning to a tag or commit SHA once validated.

jenkins/L0_Test.groovy (1)

2134-2136: Bound pip sanity check to 1 hour — consider making it configurable

The timeout is reasonable, but making it a param or env (e.g., PIP_SANITY_TIMEOUT_MIN=60) improves debuggability if flakes occur.

-                            timeout(time: 1, unit: 'HOURS') {
+                            timeout(time: env.PIP_SANITY_TIMEOUT_MIN ? env.PIP_SANITY_TIMEOUT_MIN.toInteger() : 60, unit: 'MINUTES') {
                                 checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
                             }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dfdd2a3 and 6167b28.

📒 Files selected for processing (4)
  • jenkins/Build.groovy (1 hunks)
  • jenkins/BuildDockerImage.groovy (1 hunks)
  • jenkins/L0_MergeRequest.groovy (1 hunks)
  • jenkins/L0_Test.groovy (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
jenkins/BuildDockerImage.groovy (1)

1-1: Verify Jenkins Shared Library Branch and API Stability

  • All Jenkinsfiles in jenkins/ are now consistently pinned to bloom-jenkins-shared-lib@dev-yanchaol-slurm alongside trtllm-jenkins-shared-lib@main (confirmed in
    jenkins/L0_MergeRequest.groovy,
    jenkins/BuildDockerImage.groovy,
    jenkins/L0_Test.groovy,
    jenkins/Build.groovy)
  • No direct references to com.nvidia.bloom.* classes were found in the Jenkinsfile directory, suggesting jobs rely on library steps rather than explicit class imports
  • Next steps:
    • Ensure your Jenkins Global Pipeline Library configuration indexes the dev-yanchaol-slurm branch for bloom-jenkins-shared-lib.
    • Run or dry-run affected pipelines to confirm the branch’s API surface matches downstream expectations, catching any missing or changed library methods.
jenkins/L0_MergeRequest.groovy (1)

1-1: Confirm pinned shared library branch presence and compatibility

The following Jenkinsfiles now reference bloom-jenkins-shared-lib@dev-yanchaol-slurm—ensure that this branch exists in your Jenkins shared library configuration and provides the required classes (e.g., KubernetesManager, CloudManager, SlurmConfig). If you introduced this pin for GB200 support, consider using a specific tag or commit SHA to avoid unintended drift.

• jenkins/L0_Test.groovy
• jenkins/BuildDockerImage.groovy
• jenkins/Build.groovy
• jenkins/L0_MergeRequest.groovy

jenkins/L0_Test.groovy (4)

2471-2474: Phase-2 gating extended with GB200 multi-GPU signs — aligns with new stages

Including "GB200-4_GPUs" and "GB200-8_GPUs" ensures multi-GPU phases are separated as intended. Looks good.


1839-1839: DGX_H200 4-GPU Triton post-merge stage verified

  • Confirmed that tests/integration/test_lists/test-db/l0_dgx_h200.yml is present in this PR.
  • Verified the backend: triton entry at line 181 (and the Triton tests section at line 183) in that YAML file.
  • The stage tuple ["dgx-h200-x4", "l0_dgx_h200", 1, 1, 4] adheres to the [platform, yaml, split_id, split_count, gpu_count] schema and aligns with the new l0_dgx_h200 content.

LGTM.


1-1: Verify Jenkins Global Library branch configuration

All four Jenkinsfiles have been updated to reference the dev-yanchaol-slurm branch of the bloom-jenkins-shared-lib:

  • jenkins/L0_Test.groovy
  • jenkins/L0_MergeRequest.groovy
  • jenkins/BuildDockerImage.groovy
  • jenkins/Build.groovy

Please confirm on the Jenkins controller that:

  1. The dev-yanchaol-slurm branch is configured under Global Pipeline Libraries.
  2. It contains the GB200-related Slurm and Kubernetes helper functions expected by these pipelines.

No code changes are required in this PR beyond what’s already mirrored; this step ensures the new branch will be resolvable at runtime.


1890-1898: Validate GB200 Slurm partitions and test-DB entries

  • Tests reference two new platforms in jenkins/L0_Test.groovy:
    • Lines 1890–1897: “gb200-unrestricted” and “gb200-x4” are used as keys into SlurmConfig.partitionConfig.
    • Lines 219 and 295: these lookups occur in runLLMTestlistOnSlurm and runLLMTestlistOnSlurm_MultiNodes.
  • The test-DB YAML files exist and include the correct backends:
    • tests/integration/test_lists/test-db/l0_gb200.yml covers pytorch, tensorrt, and triton.
    • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml covers pytorch on 4 GPUs.
  • The actual definitions of SlurmConfig.partitionConfig (and SlurmConfig.clusterConfig) live in the shared library, not this repo.

Please manually verify that in the shared library:

  1. SlurmConfig.partitionConfig includes entries for
    • “gb200-unrestricted”
    • “gb200-x4”
  2. SlurmConfig.clusterConfig maps those partitions to valid clusters.
  3. The cluster definitions support the requested GPU counts and backends.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15986 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15986 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12015 (Partly Tested) completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15994 [ run ] triggered by Bot

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from 6167b28 to f39899e Compare August 21, 2025 05:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)

1745-1745: Same invalid flag in non-Slurm dockerArgs. Remove “--cap-add syslog”.

Mirror the fix at Line 266 here as well to prevent failures when running on a docker node.

-    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"

266-266: Fix invalid Docker capability flag: “--cap-add syslog” breaks container launch.

Docker doesn’t recognize “syslog”; capabilities must match kernel names (e.g., SYS_ADMIN). This flag causes container start failures. Remove it.

Apply:

-                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"

Follow-up: consider centralizing dockerArgs in one helper to avoid future drift between call sites.

🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

2134-2136: Timeout for pip-install sanity: add explicit timeout diagnostics.

The 1h timeout prevents indefinite stalls—nice. To aid triage, add a message on timeout so it’s obvious in logs why this stage aborted.

-                            timeout(time: 1, unit: 'HOURS') {
-                                checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
-                            }
+                            try {
+                                timeout(time: 1, unit: 'HOURS') {
+                                    checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
+                                }
+                            } catch (InterruptedException e) {
+                                echo "checkPipInstall timed out after 1 hour for ${cpu_arch}/${wheelPath}"
+                                throw e
+                            }

2471-2474: Phase-2 gating: include GB200 signs and improve readability.

Good catch adding GB200-4/8_GPUs to the phase-2 split. Minor readability nit: avoid nested implicit it in closures.

-                        def dgxSigns = ["DGX_H100", "DGX_H200", "GB200-4_GPUs", "GB200-8_GPUs", "DGX_B200"]
-                        singleGpuJobs = parallelJobs.findAll{!dgxSigns.any{sign -> it.key.contains(sign)}}
-                        dgxJobs = parallelJobs.findAll{dgxSigns.any{sign -> it.key.contains(sign)}}
+                        def dgxSigns = ["DGX_H100", "DGX_H200", "GB200-4_GPUs", "GB200-8_GPUs", "DGX_B200"]
+                        singleGpuJobs = parallelJobs.findAll { entry -> !dgxSigns.any { s -> entry.key.contains(s) } }
+                        dgxJobs      = parallelJobs.findAll { entry ->  dgxSigns.any { s -> entry.key.contains(s) } }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6167b28 and f39899e.

📒 Files selected for processing (4)
  • jenkins/Build.groovy (1 hunks)
  • jenkins/BuildDockerImage.groovy (1 hunks)
  • jenkins/L0_MergeRequest.groovy (1 hunks)
  • jenkins/L0_Test.groovy (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • jenkins/Build.groovy
  • jenkins/L0_MergeRequest.groovy
  • jenkins/BuildDockerImage.groovy
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

1-1: Pin shared library ref (avoid drifting dev branch).

Switching bloom-jenkins-shared-lib to dev-yanchaol-slurm looks intentional for SLURM work, but a moving dev branch can break the pipeline unexpectedly. Prefer a tagged release or commit SHA to guarantee reproducibility. Confirm the branch is long-lived and compatible with other Jenkinsfiles updated in this PR.

Would you like me to open a follow-up to pin this to a specific commit across all Jenkinsfiles for consistency?


1839-1845: The script above will dump the DGX H200 YAML so we can confirm that:

  • The l0_dgx_h200.yml file defines the “Triton”, “PyTorch”, and “TensorRT” post-merge subsets.
  • The system_gpu_count and ranges entries match the 1-of-1 and 1..3-of-3 splits you wired in L0_Test.groovy.

Once we validate those, we can conclude whether everything is correctly wired or if further adjustments are needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15994 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12021 (Partly Tested) completed with status: 'ABORTED'

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from f39899e to a8b9716 Compare August 21, 2025 05:54
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16000 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)

1745-1745: Remove second occurrence of “--cap-add syslog” in dockerArgs

There’s another dockerArgs with “--cap-add syslog” used by non-Slurm runs. Remove it to avoid Docker run failures there as well.

-    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"

266-266: Remove invalid Docker capability flag: “--cap-add syslog” breaks container launch

This was flagged previously and still appears here. Docker rejects “--cap-add syslog”; remove it to fix SLURM runner container starts.

-                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+                def dockerArgs = "--gpus ${gpuCount} --cap-add=SYS_ADMIN --ipc=host --security-opt seccomp=unconfined -u root:root -v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)

24-40: Large NVFP4 DeepSeekV3Lite matrix: consider param compaction

This enumerates many orthogonal dims (moe_backend, fp8kv, attention_dp, cuda_graph, overlap_scheduler, torch_compile, mtp_nextn). If maintenance becomes painful, consider moving these to a generated include (e.g., Mako/templated expansion) to reduce churn and drift. Not blocking.

jenkins/L0_Test.groovy (1)

2135-2137: 1-hour timeout around pip-install sanity: good guardrail

This prevents hangs in the checkPipInstall step from consuming the entire job window. Keep an eye on flakes; you might reduce to 30m once stability is confirmed.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f39899e and a8b9716.

📒 Files selected for processing (8)
  • jenkins/Build.groovy (1 hunks)
  • jenkins/BuildDockerImage.groovy (1 hunks)
  • jenkins/L0_MergeRequest.groovy (1 hunks)
  • jenkins/L0_Test.groovy (6 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (0 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (2 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (5)
  • jenkins/L0_MergeRequest.groovy
  • jenkins/BuildDockerImage.groovy
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
  • jenkins/Build.groovy
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_gb200.yml (5)

6-7: Correct: narrow single-GPU gating for GB200 L0 suite

Updating system_gpu_count to gte:1/lte:1 matches the “Single-GPU” intent for this YAML and avoids pulling 4-GPU tasks into this list.


17-21: Sanity on Llama3.1 NVFP4 entries

The new NVFP4 coverage for Llama3.1 8B (including two streaming intervals) is appropriate for GB200 bring-up and aligns with the rest of the NVFP4-centric matrix in this file.


59-62: Torch “unittest/_torch … -k …” notations: double-check runner parsing

These lines rely on the internal pytest wrapper recognizing raw “path -k expr” forms (not test nodeids). This is consistent with prior lists, but please confirm the test-db renderer and the defs runner still accept these exact forms after recent refactors.

Would you like a quick repo scan script that checks each listed path exists and flags missing files/classes?

Also applies to: 70-78


90-105: TRT pre-merge block looks good

  • backend: tensorrt
  • stage: pre_merge
  • CLI-flow and unit tests chosen are representative for engine build and NVFP4/GEMM-plugin coverage.

119-122: Good Triton smoke set for GB200

The three Triton functional tests are reasonable post-merge coverage on single GPU. With the backend/stage fix above, they should render properly.

jenkins/L0_Test.groovy (3)

1-1: Pin shared library to a commit/tag to avoid pipeline drift

Moving bloom-jenkins-shared-lib to dev-yanchaol-slurm is fine for experimentation, but dev branches can change underfoot and break reproducibility. Recommend pinning to a stable tag or commit SHA.

Example:

-@Library(['bloom-jenkins-shared-lib@dev-yanchaol-slurm', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@dev-yanchaol-slurm/<commit-sha>', 'trtllm-jenkins-shared-lib@main']) _

Would you like a quick check that all Jenkinsfiles in this PR consistently use the same lib and ref?


1841-1841: Add DGX_H200-4_GPUs Triton post-merge: good

Matches the broader matrix shift toward H200 4-GPU Triton post-merge. No concerns.


2472-2474: Phase-2 gating list updated to include GB200 multi-GPU: good

The dgxSigns now capture GB200 4- and 8-GPU groups so the phase-2 stage properly separates single- and multi-GPU runs.

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from a8b9716 to edf9081 Compare August 21, 2025 06:12
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-1, GB200-TensorRT-1, GB200-Triton-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
jenkins/L0_Test.groovy (2)

1784-1784: Remove the other “--cap-add syslog” in dockerArgs (dockerNode path).

There’s still an occurrence used when running sanity on a static docker node. It will similarly break container startup.

Apply:

-def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"

285-293: Remove invalid Docker capability “--cap-add syslog” (breaks container start).

Docker does not accept “syslog” as a capability. This causes container launch failures in Slurm runs.

Apply:

 def dockerArgs = "--gpus ${gpuCount} " +
     "--cap-add=SYS_ADMIN " +
     "--ipc=host " +
     "--security-opt seccomp=unconfined " +
     "-u root:root " +
     "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " +
     "-v /tmp/ccache:${CCACHE_DIR}:rw " +
     "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " +
-    "--cap-add syslog"
+    ""

Note: There’s another occurrence in launchTestJobs’ dockerArgs used for dockerNode runs; please remove it as well (see comment below).

🧹 Nitpick comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)

17-20: Pre-merge PyTorch grid validated — tests exist and count ~87

  • All three Llama3.1 NVFP4 tests resolve to existing entries in l0_gb200.yml (lines 18–20) and companion lists (e.g. l0_b200.yml, qa/llm_function_sanity.txt)
  • DeepSeek V3 Lite NVFP4 variants, Qwen3 NVFP4 permutations, GPT-OSS W4 tests, and the E2E cache-router eviction test all map to concrete entries in the test-db
  • Approximate count of pre-merge PyTorch tests in this block: 87

Optional refactor recommendation (if CI runtime is a concern):

  • Move one of the two Llama3.1 streaming intervals (stream_interval_4 or stream_interval_64) to post-merge
  • Gate a subset of DeepSeek V3 “chunked_prefill” permutations behind a post-merge flag

Also applies to lines 24–72.

jenkins/L0_Test.groovy (1)

274-284: Debug dumps are helpful; minor nit on env vars.

The env dumps aid triage. If these lines ever move under withEnv, prefer $WORKSPACE/$WORKSPACE_TMP to reflect overridden values at runtime.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0929b13 and 426d490.

📒 Files selected for processing (3)
  • jenkins/L0_Test.groovy (11 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (3 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (13)
tests/integration/test_lists/test-db/l0_gb200.yml (4)

6-7: Single‑GPU gating set correctly (gte:1, lte:1).

Matches the PR goal of migrating single‑GPU GB200 tests and aligns with the Jenkins stage definitions for GB200 single‑GPU runs.


76-85: Post‑merge TensorRT block looks consistent with CLI flow and TRT unit tests.

  • Backend set to tensorrt with stage post_merge and single‑GPU gating on GB200 aarch64 — consistent with the Jenkins “GB200‑TensorRT‑Post‑Merge‑{1,2}” stages.
  • Mix of CLI flow and TRT unit tests covers both builder and runtime paths.

Also applies to: 87-99


100-111: Triton block now correctly uses backend=triton and post_merge.

This fixes the earlier backend/stage mismatch that would have yielded an empty list for “GB200‑Triton‑Post‑Merge‑1.” Good catch.

Also applies to: 113-116


118-121: Post‑merge PyTorch additions align with NVFP4 matrices.

Reasonable breadth across CUTLASS/TRTLLM, attention_dp, and fp8kv toggles for single‑GPU GB200. No issues spotted.

Also applies to: 131-144

jenkins/L0_Test.groovy (9)

1-1: Pinning shared library branch to a dev branch — verify availability and stability.

Using bloom-jenkins-shared-lib@dev-yanchaol-slurm is fine for experimentation, but please confirm this ref exists on the Jenkins controller and won’t be GC’d or force‑pushed.

If this is long‑lived, consider pinning to a tag or commit SHA to avoid accidental breakages from branch moves.


214-219: Workspace cleanup after result packaging looks safe.

This executes after cacheErrorAndUploadResult has archived results. No objections.


228-233: Per‑node suffix and custom workspace improve isolation.

Unique node names and /tmp workspaces reduce cross‑job interference on the Slurm front‑end. LGTM.


299-299: Multi‑stage Docker runner usage is correct here.

Passing true to delete the workspace before the run avoids leftover artifacts when reusing nodes.


827-827: Doc: Step label updated; no functional impact.

Comment tweak only. No action needed.


1272-1282: Early workspace cleanup before test setup is fine; ensure no required inputs reside in CWD.

Given sources and artifacts are fetched later into ${LLM_ROOT}${config}, the rm -rf here is safe.


1879-1884: New DGX H200 post‑merge stages — consistent split and GPU gating.

Adds Triton and TensorRT splits with 4‑GPU gating. Matches recent matrix expansion; no issues spotted.


2173-2174: Add 1‑hour timeout around checkPipInstall — good guardrail.

Avoids hangs in pip sanity. Looks good.


1932-1941: Verification Required: Slurm Partitions and Test-DB Entry

  • I couldn’t locate a SlurmConfig class or a partitionConfig assignment in this repository—these are likely defined in your shared pipeline library.
    → Please confirm that the partitions “gb200-unrestricted” and “gb200-x4” are indeed defined in SlurmConfig.partitionConfig.

  • Confirmed presence of the test-DB file:
    tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml exists on this branch.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16278 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12237 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from 426d490 to 9de4407 Compare August 23, 2025 16:05
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-PyTorch-1, GB200-PyTorch-2, GB200-PyTorch-3, GB200-TensorRT-Post-Merge-1, GB200-TensorRT-Post-Merge-2, GB200-Triton-Post-Merge-1, GB200-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16281 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integration/test_lists/test-db/l0_gb200.yml (1)

100-116: Triton backend/stage now correct and aligned with “GB200‑Triton‑Post‑Merge‑1”.

Previous backend/stage mismatch is resolved (backend=triton, stage=post_merge). This will prevent the empty selection issue noted earlier.

🧹 Nitpick comments (3)
tests/integration/test_lists/test-db/l0_gb200.yml (3)

87-99: TRT test selection provides engine, quant, and GEMM coverage; consider one GB200‑specific plugin smoke if available.

Current set is solid. If there’s a lightweight GB200‑specific kernel/plugin sanity (e.g., FP4 GEMM on GB200 path, or KV‑cache plugin), adding one micro‑test can shorten failure signal on regressions.


76-144: DRY optional: factor repeated condition clauses via YAML anchors (future maintenance).

The three single‑GPU GB200 Grace/Ubuntu condition blocks repeat the same wildcards/ranges. Using YAML anchors/aliases would reduce duplication and the chance of drift, but it’s optional if the team prefers explicit blocks.

Example:

-l0_gb200:
-- condition:
-    ranges:
-      system_gpu_count:
-        gte: 1
-        lte: 1
-    wildcards:
-      gpu:
-      - '*gb200*'
-      linux_distribution_name: ubuntu*
-      cpu: aarch64
+common_single_gb200: &common_single_gb200
+  ranges:
+    system_gpu_count: { gte: 1, lte: 1 }
+  wildcards:
+    gpu: ['*gb200*']
+    linux_distribution_name: ubuntu*
+    cpu: aarch64
+
+l0_gb200:
+- condition:
+    <<: *common_single_gb200
   terms:
     stage: pre_merge
     backend: pytorch
  tests:
   …
- condition:
-    ranges:
-      system_gpu_count:
-        gte: 1
-        lte: 1
-    wildcards:
-      gpu:
-      - '*gb200*'
-      linux_distribution_name: ubuntu*
-      cpu: aarch64
+ condition:
+    <<: *common_single_gb200
   terms:
     stage: post_merge
     backend: tensorrt
  tests:
   …

17-144: Balance GB200 shards in l0_gb200.yml to reduce CI flakiness

The shard counts are currently very uneven, which can lead to timeouts and instability:

  • Block 1: 57 test selectors
  • Block 2: 13 test selectors
  • Block 3: 5 test selectors
  • Block 4: 14 test selectors

Please redistribute tests more evenly—e.g., split the largest block or move some heavy tests into under-utilized shards—to help stabilize CI runtime.
File: tests/integration/test_lists/test-db/l0_gb200.yml

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 426d490 and 9de4407.

📒 Files selected for processing (3)
  • jenkins/L0_Test.groovy (11 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (3 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
  • jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tests/integration/test_lists/test-db/l0_gb200.yml (5)

51-57: Nice inclusion of e2e quickstarts for single‑GPU GB200.

The curated PTP e2e tests cover NVFP4, FP8, MTP, EAGLE3, and n‑gram. This gives good breadth without invoking multi‑GPU paths.


76-86: Post‑merge TRT condition is well‑scoped for GB200 single‑GPU on Grace/Ubuntu.

The stage/backend/wildcards combination matches the intended GB200 TRT single‑GPU runner profile.


6-7: Multi-GPU flag scan completed; no issues detected.

  • The gating file tests/integration/test_lists/test-db/l0_gb200.yml lists only single-GPU parameters (gte: 1, lte: 1) and contains no “4gpu”, “multi-GPU”, or TP/PP/EP >1 hints.
  • A full-text sweep of the tests directory for tensor_parallel, pipeline_parallel, expert_parallel, tp, pp, or ep assignments >1 yielded no matches.

Based on these results, the single-GPU migration gating is correctly implemented and there are no unintended multi-GPU dependencies.


17-49: No additional baseline needed for TestLlama3_1_8B
The TestLlama3_1_8B class already defines both a non-quantized baseline (test_bfloat16) and an FP8 baseline (test_fp8), so the suggestion to add one of these tests is redundant. No further changes are required.

Likely an incorrect or invalid review comment.


118-144: No additional test needed—torch_compile=True for TRTLLM is already covered.

The test_nvfp4 method is decorated with:

  • @parametrize_with_ids("fp8kv,attention_dp,cuda_graph,overlap_scheduler", [(False, False, True, False), …])
  • @parametrize_with_ids("torch_compile", [False, True])
  • @parametrize_with_ids("mtp_nextn", [0, 2])
  • @parametrize_with_ids("moe_backend", ["CUTLASS", "TRTLLM"])

This generates the Cartesian product including the combination
(fp8kv=False, attention_dp=False, cuda_graph=True, overlap_scheduler=False, torch_compile=True, mtp_nextn=0, moe_backend=TRTLLM). No change is necessary.

Likely an incorrect or invalid review comment.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16281 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12239 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from 9de4407 to 8c29407 Compare August 24, 2025 09:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
jenkins/scripts/slurm_run.sh (2)

69-76: coverageConfigFile is undefined here — use the actual file path you created

You create $jobWorkspace/.coveragerc above, but later refer to $coverageConfigFile which is never set, causing sed and pytest-cov to misbehave.

-sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
+sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$jobWorkspace/.coveragerc"
@@
-    "--cov-config=$coverageConfigFile"
+    "--cov-config=$jobWorkspace/.coveragerc"

66-68: Don’t mutate paths with sed; trim whitespace instead

The pattern s/[[:space:]]+/_/g is incorrect in sed BRE and replaces whitespace with underscores, corrupting paths. Just strip whitespace.

-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
+trtllmWhlPath=$(echo "$trtllmWhlPath" | tr -d '[:space:]')
@@
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
+containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | tr -d '[:space:]')
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerLDLibPath=$(echo "$containerLDLibPath" | tr -d '[:space:]')

Also applies to: 77-80

♻️ Duplicate comments (2)
tests/integration/test_lists/test-db/l0_gb200.yml (1)

52-52: NV-bugged pre-merge test will block CI — mark xfail or move to post-merge

disaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction is annotated with “# nvbugs 5300551” but not xfail/skip. It will fail pre-merge.

Options:

  • Add xfail in the test file with reason “nvbugs 5300551”.
  • Or remove this selector from pre_merge here and include it only in a post_merge block.
jenkins/L0_Test.groovy (1)

284-292: Remove invalid Docker flag: “--cap-add syslog” breaks container start

Docker doesn’t accept “--cap-add syslog” (kernel caps are uppercase; CAP_SYSLOG isn’t generally available). This will fail docker.image(...).inside().

                 def dockerArgs = "--gpus ${gpuCount} " +
                     "--cap-add=SYS_ADMIN " +
                     "--ipc=host " +
                     "--security-opt seccomp=unconfined " +
                     "-u root:root " +
                     "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " +
                     "-v /tmp/ccache:${CCACHE_DIR}:rw " +
-                    "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " +
-                    "--cap-add syslog"
+                    "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw "

Note: There’s another occurrence later in launchTestJobs (around Line 1789). Please remove that too.

🧹 Nitpick comments (3)
jenkins/scripts/slurm_run.sh (1)

89-92: Avoid eval; execute the array directly

eval risks word-splitting and injection. You already built an argv array; invoke it as-is.

-fullCmd="${testCmdLines[*]}"
-echo "Running: $testCase"
-echo "Full Command: $fullCmd"
-eval $fullCmd
+echo "Running: $testCase"
+printf 'Full Command (argv): %q ' "${testCmdLines[@]}"; echo
+"${testCmdLines[@]}"
tests/integration/test_lists/test-db/l0_gb200.yml (1)

1-2: Header comment is misleading

The file header says “not used in the pipeline now,” but this YAML defines concrete GB200 single-GPU suites. Either wire it up in Jenkins or update the comment to reflect current usage to avoid confusion.

Would you like me to wire this into a Jenkins stage (e.g., GB200-Single-GPU-PyTorch-1) or remove the header note?

jenkins/L0_Test.groovy (1)

214-219: Duplicate workspace cleanup — consolidate to one place

You added a “Clean up the workspace” here and also inside cacheErrorAndUploadResult (Lines 591–596). Two rm -rf passes are redundant and can complicate debugging.

Consider removing this copy and keeping cleanup centralized inside cacheErrorAndUploadResult.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9de4407 and 8c29407.

📒 Files selected for processing (6)
  • jenkins/L0_Test.groovy (11 hunks)
  • jenkins/scripts/slurm_run.sh (2 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (0 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (3 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/test_lists/test-db/l0_dgx_h200.yml
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/slurm_run.sh
  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 37-37: pytestTestTimeout is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/scripts/slurm_run.sh (1)

36-38: Provide a default for pytestTestTimeout (SC2154) and export safely

pytestTestTimeout may be unset depending on the caller. Set a sane default before exporting and quote the export.

+pytestTestTimeout="${pytestTestTimeout:-7200}"
-export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout
+export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout}"

If this script is ever run outside the multi-node launcher, this guards against spurious failures.

tests/integration/test_lists/test-db/l0_gb200.yml (2)

8-9: LGTM: single-GPU gating updated

system_gpu_count moved to gte:1/lte:1 for GB200 single-GPU. Matches the intent of single-GPU stages.


102-118: LGTM: Triton post-merge block now correctly uses backend: triton

This fixes the earlier backend/stage mismatch; “triton_server/test_triton.py” will now render for GB200-Triton-Post-Merge stages.

jenkins/L0_Test.groovy (4)

228-233: Per-node name/workspace looks good

Unique node name and custom /tmp workspace reduce collisions on shared clusters.


356-356: Timeout plumbed for MultiNodes

pytestTestTimeout is explicitly set to 7200 seconds and exported into slurm_launch.sh. Aligns with slurm_run.sh changes.


2172-2173: LGTM: bound pip-install sanity to 1 hour

Bounding checkPipInstall with a timeout avoids indefinite hung builds.


2508-2511: LGTM: clearer multi-GPU phase separation

Using ["2_GPUs", "4_GPUs", "8_GPUs"] to split phase 2 makes stage grouping explicit and matches the naming.

Signed-off-by: Yanchao Lu <[email protected]>
@chzblych chzblych force-pushed the migrate-b200-to-gb200 branch from 8c29407 to 2c2fab7 Compare August 24, 2025 09:25
@chzblych chzblych changed the title [None][infra] Migrate B200 single GPU tests to GB200 [None][infra] Prepare for single GPU GB200 test pipeline Aug 24, 2025
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1, DGX_H200-4_GPUs-Triton-Post-Merge-1"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
jenkins/scripts/slurm_run.sh (1)

5-14: Undefined coverageConfigFile: sed substitution and --cov-config will fail

The .coveragerc is created at $jobWorkspace/.coveragerc, but later lines reference $coverageConfigFile, which is never set. sed -i … "$coverageConfigFile" (Line 69) and --cov-config=$coverageConfigFile (Line 75) will fail.

Apply this diff to define and consistently use coverageConfigFile:

-# generate .coveragerc in workspace
-cat << EOF > $jobWorkspace/.coveragerc
+# generate .coveragerc in workspace
+coverageConfigFile="$jobWorkspace/.coveragerc"
+cat << EOF > "$coverageConfigFile"
 [run]
 branch = True
 data_file = $jobWorkspace/.coverage.$stageName
 [paths]
 source =
     $llmSrcNode/tensorrt_llm/
     ---wheel_path---/tensorrt_llm/
 EOF
jenkins/L0_Test.groovy (1)

1775-1775: Same issue elsewhere: drop “--cap-add syslog” from dockerArgs

This recurrence will fail non-Slurm docker runs as well.

Apply:

-    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw --cap-add syslog"
+    def dockerArgs = "-v /mnt/scratch.trt_llm_data:/scratch.trt_llm_data:ro -v /tmp/ccache:${CCACHE_DIR}:rw -v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw"
♻️ Duplicate comments (4)
jenkins/scripts/slurm_run.sh (1)

46-46: Ack: trailing comma fixed for pytest timeout-method

The prior “--timeout-method=thread,” issue is resolved; the option is now valid.

tests/integration/test_lists/test-db/l0_gb200.yml (2)

52-52: NV-bugged test in pre-merge will gatekeep CI; move or xfail

disaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction is tagged with “nvbugs 5300551” and has no xfail/skip in source. Keeping it in pre_merge will likely fail GB200 single‑GPU gating.

Suggested adjustment in this YAML (remove from pre_merge):

-  - disaggregated/test_workers.py::test_workers_kv_cache_aware_router_eviction[TinyLlama-1.1B-Chat-v1.0] # nvbugs 5300551

Alternatively, mark the test with @pytest.mark.xfail(reason="nvbugs 5300551") in tests/integration/defs/disaggregated/test_workers.py so it can remain in pre_merge without blocking.


112-118: Good fix: Triton block now matches backend=triton and stage=post_merge

This resolves the earlier backend/stage mismatch that produced empty stage renders.

jenkins/L0_Test.groovy (1)

277-286: Remove invalid Docker capability: “--cap-add syslog” breaks container launch

Docker does not support a “syslog” capability, and “--cap-add syslog” causes run failures. It also appears as a stray token in the args string.

Apply:

                 def dockerArgs = "--gpus ${gpuCount} " +
                     "--cap-add=SYS_ADMIN " +
                     "--ipc=host " +
                     "--security-opt seccomp=unconfined " +
                     "-u root:root " +
                     "-v /home/scratch.trt_llm_data:/scratch.trt_llm_data:ro " +
                     "-v /tmp/ccache:${CCACHE_DIR}:rw " +
-                    "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw " +
-                    "--cap-add syslog"
+                    "-v /tmp/pipcache/http-v2:/root/.cache/pip/http-v2:rw "
🧹 Nitpick comments (5)
jenkins/scripts/slurm_run.sh (4)

37-37: Guard CPP_TEST_TIMEOUT_OVERRIDDEN with a sane default

pytestTestTimeout is passed from the launcher, but shellcheck warns it may be unset (SC2154). Provide a fallback to avoid exporting an empty value.

Apply:

-export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout
+export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout:-3600}"

67-67: sed pattern uses literal “+”; whitespace trimming doesn’t work

sed without -E treats “+” literally, so 's/[[:space:]]+/_/g' won’t collapse whitespace. Use tr to strip whitespace/newlines.

Apply:

-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
+trtllmWhlPath=$(echo "$trtllmWhlPath" | tr -d '[:space:]')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | tr -d '[:space:]')
+containerLDLibPath=$(echo "$containerLDLibPath" | tr -d '[:space:]')

Also applies to: 78-80


23-23: Run apt-get update before install to reduce transient failures

Install may fail on fresh containers without an updated package index.

Apply:

-    apt-get install -y libffi-dev
+    apt-get update -y && apt-get install -y libffi-dev

29-29: Typo in debug log: double equals in GPU_UUIDS

Minor nit: the output currently prints “GPU_UUIDS = =…”.

Apply:

-    echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"
tests/integration/test_lists/test-db/l0_gb200.yml (1)

1-2: Header comment contradicts usage; please clarify

The header says “Don’t add any tests here” and “not used in the pipeline now,” but this file now defines active pre_merge and post_merge blocks. Please update or remove the misleading comments to prevent accidental edits elsewhere.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8c29407 and 2c2fab7.

📒 Files selected for processing (4)
  • jenkins/L0_Test.groovy (9 hunks)
  • jenkins/scripts/slurm_run.sh (2 hunks)
  • tests/integration/test_lists/test-db/l0_gb200.yml (3 hunks)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_gb200_multi_gpus.yml
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
  • jenkins/scripts/slurm_run.sh
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • tests/integration/test_lists/test-db/l0_gb200.yml
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 37-37: pytestTestTimeout is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

2157-2159: Nice: bound checkPipInstall with a 1h timeout

Good defensive guard to prevent long-running artifact queries from stalling the whole pipeline.


1923-1927: Verify GB200 single‑GPU migration wiring (stage references vs YAMLs)

SBSASlurmTestConfigs comments out “GB200-PyTorch-1” and only wires multi‑GPU (l0_gb200_multi_gpus). Given the PR objective is “Migrate B200 single GPU tests to GB200,” please confirm the intended single‑GPU GB200 pre/post stages are added somewhere (or update the comment). Otherwise, the new l0_gb200 single‑GPU content won’t be exercised.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16302 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16302 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12256 (Partly Tested) completed with status: 'SUCCESS'

@chzblych
Copy link
Collaborator Author

/bot skip --comment "Partial testing is sufficient"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16309 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16309 [ skip ] completed with state SUCCESS
Skipping testing for commit 2c2fab7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants