[None][chore] Some improvements for CI stability #7199

chzblych · 2025-08-25T05:12:44Z

Summary by CodeRabbit

Chores
- Split CI pod timeouts into TEST/BUILD/SLURM (SLURM extended to ~22h); unified build timeout.
- Per-run multi-node isolation with unique run IDs, per-run scripts/logs, remote capture, and coordinated teardown.
- Added retry for package installs, richer environment dumps, explicit workspace purges, and clearer "Finished test stage execution" logs.
Bug Fixes
- Improved Slurm handling: capture/validate job IDs, extended online-wait (~15h), and Slurm-aware cleanup for single- and multi-node runs.

coderabbitai · 2025-08-25T05:12:50Z

📝 Walkthrough

Walkthrough

Refactors pod timeout constants into test/build/slurm scopes; introduces per-run Slurm jobUID, per-run scripts/logs, and server-side extraction/validation/propagation of Slurm job IDs; extends online-wait window; updates cleanup signatures to accept Slurm IDs and run Slurm-specific cleanup; improves logging, workspace hygiene, and pip-install retry.

Changes

Cohort / File(s)	Summary of Changes
L0 pipeline (timeouts, Slurm job handling, cleanup, logging) `jenkins/L0_Test.groovy`	Replaced `POD_TIMEOUT_SECONDS`/`POD_TIMEOUT_SECONDS_TMP` with `POD_TIMEOUT_SECONDS_TEST`, `POD_TIMEOUT_SECONDS_BUILD`, `POD_TIMEOUT_SECONDS_SLURM`; assign timeouts per-container in pod config; add per-run `jobUID`, per-run `slurm_run.sh`/`slurm_launch.sh` and `slurm_output.log`; upload/cat remote scripts for visibility; parse/validate `slurmJobID` from submission output (single- and multi-node); extend online-wait loop (12→90 iterations); update cleanup signatures to accept `slurmJobID` and run `scancel`/`sacct`/`scontrol`; propagate per-run Slurm IDs to final cleanup; add env/pwd/ls dumps, workspace purge steps, delays, and explicit logs; wrap pip-install checks with `llmRetry`.
Build pod timeout constant `jenkins/Build.groovy`	Remove `POD_TIMEOUT_SECONDS` and `POD_TIMEOUT_SECONDS_TMP`; add `POD_TIMEOUT_SECONDS_BUILD` (default 43200); use `POD_TIMEOUT_SECONDS_BUILD` for build pod sleep/timeout.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant J as Jenkins pipeline
  participant N as Remote node
  participant S as Slurm scheduler
  participant K as Kubernetes

  rect rgb(250,250,255)
    note right of J: prepare per-run jobUID\ncreate per-run workspace & scripts\nlog env/pwd/ls
  end

  J->>N: upload per-run scripts (slurm_run.sh, slurm_launch.sh)
  J->>N: cat uploaded scripts (debug visibility)
  J->>N: execute submission script -> tee per-run slurm_output.log
  N->>S: sbatch / srun submission
  S-->>N: submission output (contains job ID)
  N-->>J: slurm_output.log (captured)
  J->>J: parse slurm_output.log -> extract & validate slurmJobID
  alt Node online within window
    J->>K: run tests (containers use POD_TIMEOUT_SECONDS_TEST / BUILD / SLURM)
  else Timeout / failure
    J->>J: fail pipeline with explicit error and trigger cleanup
  end
  rect rgb(245,255,245)
    note right of J: Finalization — invoke cleanup with slurmJobID\nrun scancel/sacct/scontrol and remove node resources\npurge workspace & logs
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge #7074 — modifies jenkins/L0_Test.groovy and touches pip install timeout/handling (closely related to retry/timeouts changes).
[None][infra] Prepare for single GPU GB200 test pipeline #7073 — alters multi-node Slurm flow and per-run handling in jenkins/L0_Test.groovy (overlaps multi-node changes).
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages #5126 — adjusts createKubernetesPodConfig in jenkins/L0_Test.groovy (related to per-container timeout assignments).

Suggested reviewers

kxdc
tburt-nv

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

chzblych · 2025-08-25T05:14:07Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

jenkins/L0_Test.groovy (1)
116-118: Confirm Slurm results path compatibility with the new shared-lib

This scp targets /home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml. If the dev branch changed where the agent writes results, this silently degrades to “No results xml to submit” and we lose JUnit data. Recommend confirming the path contract; optionally allow an env override for fast iteration.

Minimal, backward-compatible tweak to enable an override:
-            def resultsFilePath = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml"
-            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
+            def defaultResultsFile = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml"
+            def resultsFilePath = env.SLURM_RESULTS_FILE ?: defaultResultsFile
+            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
Quick verification during a trial run: check both old/new locations before scp to reduce false negatives.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f61b74f and eaa6be3.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

jenkins/L0_Test.groovy (1)

242-247: Verify library resource availability: slurm_jenkins_agent_setup.sh

This relies on Utils.copyLibraryResource to fetch slurm_jenkins_agent_setup.sh from the shared lib. Ensure the dev branch keeps the same resource path/name; otherwise this step fails early in “Request Node via SLURM.”

Suggested dry-run check during the first job on the new branch:

Log the resolved library resource path before chmod.

If failure, print “library resource not found in bloom-jenkins-shared-lib@dev-yanchaol-slurm-output” to speed diagnosis.

jenkins/L0_Test.groovy

tensorrt-cicd · 2025-08-25T05:19:15Z

PR_Github #16379 [ run ] triggered by Bot

chzblych · 2025-08-25T05:27:54Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-25T05:33:20Z

PR_Github #16383 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-25T05:33:22Z

PR_Github #16379 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-25T08:32:32Z

PR_Github #16383 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12311 (Partly Tested) completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

chzblych · 2025-08-25T15:00:31Z

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

chzblych · 2025-08-25T15:04:52Z

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-25T15:06:25Z

PR_Github #16456 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-25T15:10:02Z

PR_Github #16457 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-25T15:10:04Z

PR_Github #16456 [ run ] completed with state ABORTED

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

jenkins/L0_Test.groovy (1)

1-1: Resolved: shared-lib pinned to main (reproducible builds).

You reverted to using bloom-jenkins-shared-lib@main. This addresses the prior concern about pinning to a feature branch and restores reproducibility.

🧹 Nitpick comments (6)

jenkins/L0_Test.groovy (6)

248-249: Gate full script dumps behind a debug flag to avoid leaking internals to the console.

Dumping the entire setup script into the build log increases noise and risks exposing internal paths/infra details. Use the existing DETAILED_LOG flag to control this.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}")
+                }

300-304: Replace fixed sleeps with deterministic destroy+wait logic.

Fixed 30s sleeps add latency without guarantees. Prefer a destroy + bounded wait for node offline to make cleanup predictable and faster when possible.

Apply this diff:

-        Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
-        CloudManager.destroyNode(nodeName)
-        Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-        cleanUpNodeResources(pipeline, cluster, nodeName)
+        timeout(time: 10, unit: 'MINUTES') {
+            // Try destroy, then verify the node goes offline; retry a few times if needed.
+            retry(3) {
+                CloudManager.destroyNode(nodeName)
+                sleep(time: 10, unit: 'SECONDS')
+                if (CloudManager.isNodeOnline(nodeName)) {
+                    error "Node ${nodeName} still online after destroy attempt; retrying..."
+                }
+            }
+        }
+        cleanUpNodeResources(pipeline, cluster, nodeName)

326-326: Minor: gate noisy env/pwd/ls dump behind DETAILED_LOG.

Great for debugging; consider guarding with DETAILED_LOG to keep default logs tight.

Example change:

-    Utils.exec(pipeline, script: "env | sort && pwd && ls -alh")
+    if (testFilter[(DETAILED_LOG)]) {
+        Utils.exec(pipeline, script: "env | sort && pwd && ls -alh")
+    }

370-370: Gate slurm_run.sh content dump behind DETAILED_LOG.

Same rationale as above; reduces log noise and potential info exposure.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}")
+                }

428-428: Gate slurm_launch.sh content dump behind DETAILED_LOG.

The generated script includes environment and paths; dump it only when requested.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}")
+                }

322-325: Harden jobUID/suffix for cross-system safety
BUILD_TAG may include characters that aren’t valid in Slurm job names, SSH identifiers, or filesystem paths. To prevent downstream failures, sanitize the suffix to allow only lowercase letters, digits, hyphens (and underscores if desired), then cap the entire jobUID at 63 characters (the common max for many schedulers and filesystems).

Please update lines 322–325 in jenkins/L0_Test.groovy as follows:

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name
+    def suffixRaw = ("${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}").toLowerCase()
+    String customSuffix = suffixRaw.replaceAll('[^a-z0-9-]', '')
+    // Build jobUID and cap length for Slurm/SSH/filesystem compatibility
+    def jobUID = ("${cluster.host}-multi_node_test-${customSuffix}")
+        .replaceAll('[^a-z0-9-_]', '')
+        .take(63)

Optionally, add a runtime check to flag any violations:

#!/bin/bash
# Ensure jobUID only contains [-a-z0-9_] and is ≤63 chars
rg -n 'multi_node_test' --type=groovy -C2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d0980b5 and 9b4e355.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

jenkins/L0_Test.groovy

chzblych · 2025-08-25T15:18:49Z

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-25T15:24:03Z

PR_Github #16458 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-25T15:24:05Z

PR_Github #16457 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #12364 (Partly Tested) completed with status: 'ABORTED'

tensorrt-cicd · 2025-08-25T20:17:34Z

PR_Github #16458 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12365 (Partly Tested) completed with status: 'SUCCESS'

jenkins/L0_Test.groovy

chzblych · 2025-08-26T02:32:59Z

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW" --disable-fail-fast

tensorrt-cicd · 2025-08-26T02:37:51Z

PR_Github #16503 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T14:03:36Z

PR_Github #16867 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T16:08:55Z

PR_Github #16867 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12665 (Partly Tested) completed with status: 'FAILURE'

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (13)

jenkins/L0_Test.groovy (13)

47-49: Make pod timeouts scope-specific; don’t let one env var override all three.

Use per-scope overrides with fallback to global, as previously suggested.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

137-169: Guard SLURM cleanup when no job ID; avoid unconditional scancel/sacct.

Skip cleanup if submission failed. This was flagged earlier and still applies.

-        Utils.exec(
-            pipeline,
-            timeout: false,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim()) {
+            Utils.exec(
+                pipeline,
+                timeout: false,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

171-205: Same issue in single-node cleanup: guard on missing slurmJobID.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
@@
-        Utils.exec(
-            pipeline,
-            timeout: false,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim()) {
+            Utils.exec(
+                pipeline,
+                timeout: false,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

284-297: Broaden job-id parsing to include --parsable numeric output.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

347-353: Cleanup: replace shell sleeps; guard cleanup on job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+            echo "Sleeping to allow docker stop..."
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            echo "Sleeping to allow node destruction..."
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

371-377: Sanitize BUILD_TAG-derived suffix for jobUID; enforce [a-z0-9-] and length.

Prevents illegal characters in job names/paths. Apply similar sanitizer to single-node (Lines 243–247).

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    customSuffix = customSuffix.length() > 63 ? customSuffix.substring(0, 63) : customSuffix
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

423-425: Avoid exposing passwords on process command line (scp upload of run script).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)

476-477: Harden pipe: enable pipefail and quote output path.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

480-482: Avoid exposing passwords on process command line (scp upload of launch script).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

501-513: Broaden multi-node job-id extraction; include numeric lines.

-            def slurmJobID = Utils.exec(
+            def slurmJobID = Utils.exec(
                 pipeline,
                 timeout: false,
                 script: Utils.sshUserCmd(
                     remote,
-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"sed -n " +
+                    "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " +
+                    "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                    "${slurmOutputFile} | tail -n1\""
                 ),
                 returnStdout: true
             ).trim()

520-523: Replace shell sleep; guard cleanup on job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
-
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+            echo "Sleeping to allow slurm job termination..."
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

732-732: Quote sleep arg in K8s container command arrays (strings required).

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 750-750, 816-816

2256-2263: Retry mismatch: comment says 2, code retries 1. Align to intent.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
@@
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {

🧹 Nitpick comments (3)

jenkins/L0_Test.groovy (3)
272-273: Nit: avoid dumping script content unless needed.

If the setup script ever contains secrets/URLs, avoid logging it verbatim.

274-282: Prefer sbatch --parsable at source.

If possible, make SlurmConfig.generateCommand(...) use “sbatch --parsable” and capture the numeric ID directly to avoid brittle parsing here and downstream.

303-303: Use Jenkins sleep step instead of shell sleep.
-                Utils.exec(pipeline, script: "echo Sleeping to allow agent initialization; sleep 30")
+                echo "Sleeping to allow agent initialization..."
+                sleep(time: 30, unit: 'SECONDS')

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0ad26bf and 7ec59b1.

📒 Files selected for processing (2)

jenkins/Build.groovy (2 hunks)
jenkins/L0_Test.groovy (17 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

jenkins/Build.groovy

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (9)

jenkins/L0_Test.groovy (9)

249-251: LGTM: initialize slurmJobID.

298-301: LGTM: validate job ID and fail early.

308-313: Verify wait window alignment (15h) with Slurm/K8s timeouts.

Ensure the 90×10min loop won’t exceed job/pod limits; consider making it env-configurable.

343-344: LGTM: explicit failure on node not online.

490-491: LGTM: run remote launch script via ssh.

495-496: LGTM: clear end-of-stage logging.

515-519: LGTM: failure path shows output then errors.

652-660: LGTM: workspace cleanup on no-result path.

682-690: LGTM: final workspace cleanup and explicit marker.

chzblych · 2025-08-28T16:45:16Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-28T16:50:46Z

PR_Github #16889 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T17:50:19Z

PR_Github #16889 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12686 (Partly Tested) completed with status: 'FAILURE'

jenkins/L0_Test.groovy

chzblych · 2025-08-28T18:09:33Z

/bot skip --comment "Tested few cases including kill, abort, pass, etc."

tensorrt-cicd · 2025-08-28T18:14:39Z

PR_Github #16896 [ skip ] triggered by Bot

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

jenkins/L0_Test.groovy (1)

118-121: Same secret exposure in uploadResults; switch to sshpass -e.

Inline passwords leak via process list/logs.

-            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
+            def downloadResultCmd = "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"

♻️ Duplicate comments (11)

jenkins/L0_Test.groovy (11)

734-734: Quote sleep argument in k8s command arrays (must be strings).

Unquoted values can render non-strings and break the pod spec.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 752-752, 819-819

136-169: Guard SLURM cleanup when job ID is missing; avoid scancel/sacct with empty ID.

Skip remote cleanup if submission failed.

-        Utils.exec(
+        if (slurmJobID?.trim()) {
+        Utils.exec(
             pipeline,
             timeout: false,
             script: Utils.sshUserCmd(
                 remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
             )
-        )
+        )}

171-205: Same: guard single-node cleanup when no job ID.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
 
-        Utils.exec(
+        if (slurmJobID?.trim()) {
+        Utils.exec(
             pipeline,
             timeout: false,
             script: Utils.sshUserCmd(
                 remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
             )
-        )
+        )}

47-49: Use scope-specific pod timeout envs; avoid a single global override.

Prefer per-scope overrides with a global fallback.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

284-297: Broaden Slurm ID parsing; support --parsable (plain numeric) output.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

347-352: Use Jenkins sleep step; guard cleanup when no job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

371-376: Sanitize BUILD_TAG-derived suffix (safe chars, max length); apply same to single-node path.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) customSuffix = customSuffix.substring(0, 63)
     def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

Also mirror this sanitizer in the single-node block (Lines 243–248).

500-525: Verify ID before cleanup; broaden extraction; avoid running cleanup first.

-        stage('Clean up SLURM Resources') {
-            def slurmJobID = Utils.exec(
+        stage('Clean up SLURM Resources') {
+            def slurmJobID = Utils.exec(
                 pipeline,
                 timeout: false,
                 script: Utils.sshUserCmd(
                     remote,
-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"sed -n " +
+                    "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                    "-e 's/^\\([0-9][0-9]*\\)\$/\\1/p' " +
+                    "${slurmOutputFile} | tail -n1\""
                 ),
                 returnStdout: true
             ).trim()
 
-            Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
-
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
-
-            if (!slurmJobID || !slurmJobID.isNumber()) {
-                Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
-                echo "Slurm job did not submit successfully. No job ID found."
-            }
-
-            Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+            if (slurmJobID ==~ /^\d+$/) {
+                sleep(time: 30, unit: 'SECONDS')
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+                echo "Slurm job ID: ${slurmJobID}"
+            } else {
+                Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
+                echo "Slurm job did not submit successfully. No job ID found."
+            }
         }

423-425: Avoid exposing passwords on the process command line; use sshpass -e (env var).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

Also applies to: 480-482

2258-2264: Retry count mismatch with comment; llmRetry(1) won’t “retry 2 times.”

Align code with intent.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
                             sh "env | sort"
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {
                                 timeout(time: 30, unit: 'MINUTES') {
                                     checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
                                 }
                             })

If llmRetry’s parameter denotes “attempts” not “retries,” adjust accordingly. To verify:

#!/bin/bash
# Locate llmRetry definition and usage semantics
rg -nP 'def\s+llmRetry\s*\(|llmRetry\s*\(' --type groovy -C3

476-477: Preserve failure from srun through tee; quote file path.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b3f6f64 and d10cb8b.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (17 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

jenkins/L0_Test.groovy

tensorrt-cicd · 2025-08-28T18:35:39Z

PR_Github #16896 [ skip ] completed with state SUCCESS
Skipping testing for commit d10cb8b

Signed-off-by: Yanchao Lu <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (14)

jenkins/L0_Test.groovy (14)

191-223: Guard single-node SLURM cleanup on missing/invalid slurmJobID.

Currently runs sacct/scontrol/scancel unconditionally.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
-
-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        slurmJobID = slurmJobID?.trim()
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
+        if (slurmJobID ==~ /^\d+$/) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID")
+        }

261-266: Sanitize BUILD_TAG-derived suffix for safe node names/workspaces.

Avoid illegal chars (e.g., “/”, spaces) and length issues.

-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }

302-315: Also accept sbatch --parsable (numeric-only) output in single-node parsing.

Adds robustness when sbatch is invoked with --parsable.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

47-49: Make timeouts scope-specific; avoid one env overriding all pods.

Let each pod type honor its own override, with a global fallback.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

149-160: Broaden Slurm job-id extraction; handle --parsable and file-missing.

Support plain numeric output and guard when the log file is empty/missing.

-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"if [ -s '${slurmOutputFile}' ]; then " +
+                    "  sed -n " +
+                    "    -e 's/^\\([0-9]\\+\\)$/\\1/p' " +                                      # sbatch --parsable
+                    "    -e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
+                    "    -e 's/.*srun: job \\([0-9]\\+\\) .*/\\1/p' " +                         # queued/allocated
+                    "    '${slurmOutputFile}' | tail -n1; " +
+                    "fi\""

Also applies to: 153-157

162-165: Avoid Groovy String.isNumber(); use a strict numeric check.

Prevents false negatives across Groovy versions.

-        if (!slurmJobID || !slurmJobID.isNumber()) {
+        slurmJobID = slurmJobID?.trim()
+        if (!(slurmJobID ==~ /^\d+$/)) {
             Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
             error("Slurm job did not submit successfully. No job ID found.")
         }

288-289: Don’t expose passwords on command line; use sshpass -e or withEnv(SSHPASS=...).

Prevents secret leakage in process lists/logs.

Examples:

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

To find any remaining occurrences:

#!/bin/bash
rg -nP "sshpass\s+-p\s+'?\$\{?remote\.passwd\}?" -C2 --type groovy

Also applies to: 431-432, 441-447, 455-456, 498-499

316-319: Same: replace isNumber() with a strict numeric match.

-                if (!slurmJobID || !slurmJobID.isNumber()) {
+                slurmJobID = slurmJobID?.trim()
+                if (!(slurmJobID ==~ /^\d+$/)) {
                     error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}")
                 }

365-370: Use Jenkins sleep and skip cleanup when no job ID.

Aligns with best practices and avoids no-op scancel.

-        stage('Clean up SLURM Resources') {
-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+        stage('Clean up SLURM Resources') {
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }
         }

389-396: Sanitize jobUID/customSuffix derived from BUILD_TAG (multi-node too).

Mirror the single-node sanitizer to avoid illegal chars and overlong names.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

494-495: Fix pipe exit handling and quote paths in tee.

Ensure srun failures propagate through the pipe.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

729-729: Quote sleep argument in K8s pod command arrays.

Kubernetes expects array of strings; unquoted interpolation may render type mismatches.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 748-748, 813-815

2253-2259: Comment/code mismatch: llmRetry(1) won’t “retry 2 times”.

Align intent or update the comment.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
 ...
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {

If llmRetry’s parameter is “attempts” not “retries”, set to 3 accordingly.

169-177: Use Jenkins sleep and guard cleanup when no job ID.

Reduces flakiness and avoids running Slurm commands with empty IDs.

-        Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
+        sleep(time: 30, unit: 'SECONDS')
-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim() ==~ /^\d+$/) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID")
+        }

🧹 Nitpick comments (1)

jenkins/L0_Test.groovy (1)

518-520: Guard multi-node cleanup if slurmOutputFile is missing.

Avoid remote sed on a null/empty path.

-        stage('Clean up SLURM Resources') {
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile)
+        stage('Clean up SLURM Resources') {
+            if (slurmOutputFile) {
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile)
+            } else {
+                echo "Skip multi-node SLURM cleanup: no slurmOutputFile"
+            }
         }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d10cb8b and d6a1e13.

📒 Files selected for processing (2)

jenkins/Build.groovy (2 hunks)
jenkins/L0_Test.groovy (17 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

jenkins/Build.groovy

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

jenkins/L0_Test.groovy (1)

649-657: LGTM: added workspace cleanup and clear end-of-stage logs.

Improves hygiene and traceability.

Also applies to: 679-687

chzblych · 2025-08-28T19:00:45Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (13)

jenkins/L0_Test.groovy (13)

169-170: Use Jenkins sleep step, not shell sleep.

Cleaner and shows up in Blue Ocean.

-        Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
+        sleep(time: 30, unit: 'SECONDS')

191-221: Same: guard single-node cleanup if job ID missing.

Avoid running Slurm commands with an empty ID.

-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

47-49: Use scope-specific pod timeouts; avoid a single env overriding all.

Prefer per-scope overrides with fallback to the global to prevent accidental cross-scope overrides.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

149-160: Broaden Slurm job ID extraction (support --parsable and other variants).

Handle plain numeric output and “queued and waiting for resources” in addition to existing patterns.

-        def slurmJobID = Utils.exec(
+        def slurmJobID = Utils.exec(
             pipeline,
             script: Utils.sshUserCmd(
                 remote,
-                "\"sed -n " +
-                "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                "${slurmOutputFile} | tail -n1\""
+                "\"sed -n " +
+                "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " +                                                  /* --parsable */
+                "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                "-e 's/.*job \\([0-9][0-9]*\\) queued and waiting for resources.*/\\1/p' " +
+                "${slurmOutputFile} | tail -n1\""
             ),
             returnStdout: true
         ).trim()

162-165: Avoid Groovy String.isNumber(); use a numeric regex.

This is more robust across Groovy versions and avoids NPEs.

-        if (!slurmJobID || !slurmJobID.isNumber()) {
+        slurmJobID = slurmJobID?.trim()
+        if (!(slurmJobID ==~ /^\d+$/)) {
             Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
             error("Slurm job did not submit successfully. No job ID found.")
         }

171-177: Guard cleanup when no job ID.

Skip sacct/scontrol/scancel if submission failed.

-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

302-315: Accept --parsable output during single-node submit parsing.

Add numeric-only line support.

-                def jobIDs = slurmSubmitOutput
+                def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

316-319: Same: replace isNumber() with regex check.

-                if (!slurmJobID || !slurmJobID.isNumber()) {
+                slurmJobID = slurmJobID?.trim()
+                if (!(slurmJobID ==~ /^\d+$/)) {
                     error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}")
                 }

365-370: Use Jenkins sleep and guard cleanup.

Replace shell sleeps and skip cleanup if no job ID.

-        stage('Clean up SLURM Resources') {
-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
-            CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
-        }
+        stage('Clean up SLURM Resources') {
+            sleep(time: 30, unit: 'SECONDS')
+            CloudManager.destroyNode(nodeName)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }
+        }

389-392: Sanitize BUILD_TAG-derived job UID (avoid illegal chars and length issues).

Prevents slashes/spaces from breaking paths or identifiers.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

494-494: Fix unterminated var risk earlier; add pipefail and quote output path.

Ensures srun failures propagate through tee and paths with spaces are safe.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

729-729: Quote sleep args in K8s container commands (array of strings).

K8s expects strings; unquoted numbers can misrender.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 747-747, 813-813

2253-2259: Align retry count with comment; llmRetry(1) likely doesn’t retry twice.

Make behavior match “Retry 2 times”.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
...
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {
                                 timeout(time: 30, unit: 'MINUTES') {
                                     checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
                                 }
                             })

🧹 Nitpick comments (2)

jenkins/L0_Test.groovy (2)

324-331: Avoid magic numbers in online-wait; derive attempts or make configurable.

Compute max checks from an env var or derive from duration/sleep interval.

-            while (!CloudManager.isNodeOnline(nodeName) && counter < 90) {
-                // Wait 10 minutes to check status of the node again
-                sleep(time: 10, unit: 'MINUTES')
+            def checkIntervalMin = 10
+            def maxWaitMin = (env.SLURM_ONLINE_WAIT_MINUTES ?: "900") as Integer  // default 15h
+            def maxChecks = (int)(maxWaitMin / checkIntervalMin)
+            while (!CloudManager.isNodeOnline(nodeName) && counter < maxChecks) {
+                sleep(time: checkIntervalMin, unit: 'MINUTES')
                 counter++
             }

649-657: Prefer deleteDir() over shell rm -rf for workspace cleanup.

deleteDir() is portable and reliable in Jenkins workspaces.

-                // Clean up the workspace
-                sh """
-                    env | sort
-                    pwd && ls -alh
-                    rm -rf ./*
-                """
+                // Clean up the workspace
+                sh """
+                    env | sort
+                    pwd && ls -alh
+                """
+                deleteDir()

-        // Clean up the workspace
-        sh """
-            env | sort
-            pwd && ls -alh
-            rm -rf ./*
-        """
+        // Clean up the workspace
+        sh """
+            env | sort
+            pwd && ls -alh
+        """
+        deleteDir()

Also applies to: 679-687

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d6a1e13 and 814bf35.

📒 Files selected for processing (2)

jenkins/Build.groovy (2 hunks)
jenkins/L0_Test.groovy (17 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

jenkins/Build.groovy

🔇 Additional comments (2)

jenkins/L0_Test.groovy (2)

513-513: LGTM: helpful completion log.

292-301: Verify SlurmConfig.generateCommand signature and Jenkins.instance.rootUrl access.
SlurmConfig comes from the shared library – confirm its generateCommand method accepts (SlurmCluster, SlurmPartition, nodeSecret, nodeName, rootUrl). Also ensure Jenkins.instance.rootUrl is permitted in the pipeline sandbox (import jenkins.model.Jenkins or switch to env.JENKINS_URL if needed).

tensorrt-cicd · 2025-08-28T19:06:25Z

PR_Github #16900 [ run ] triggered by Bot

chzblych · 2025-08-28T20:00:43Z

/bot skip --comment "Tested a few cases including kill, abort, pass, etc."

tensorrt-cicd · 2025-08-28T20:06:11Z

PR_Github #16904 [ skip ] triggered by Bot

tensorrt-cicd · 2025-08-28T20:06:13Z

PR_Github #16900 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-28T20:19:18Z

PR_Github #16904 [ skip ] completed with state SUCCESS
Skipping testing for commit 814bf35

chzblych requested review from a team as code owners August 25, 2025 05:12

chzblych requested review from tburt-nv and kxdc August 25, 2025 05:12

chzblych marked this pull request as draft August 25, 2025 05:12

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych changed the title ~~[None][test] - Test Slurm output file and path~~ [None][chore] Some improvements for CI stability Aug 25, 2025

chzblych force-pushed the yanchaol-slurm-output branch from d0980b5 to 9b4e355 Compare August 25, 2025 15:04

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych force-pushed the yanchaol-slurm-output branch 2 times, most recently from e57f754 to f89dfd7 Compare August 25, 2025 15:18

chzblych commented Aug 26, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych force-pushed the yanchaol-slurm-output branch from 0ad26bf to 7ec59b1 Compare August 28, 2025 16:21

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

chzblych force-pushed the yanchaol-slurm-output branch from 7ec59b1 to b3f6f64 Compare August 28, 2025 16:44

chzblych commented Aug 28, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych marked this pull request as ready for review August 28, 2025 18:09

chzblych enabled auto-merge (squash) August 28, 2025 18:13

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

jenkins/L0_Test.groovy Show resolved Hide resolved

chzblych disabled auto-merge August 28, 2025 18:24

chzblych force-pushed the yanchaol-slurm-output branch from d10cb8b to d6a1e13 Compare August 28, 2025 18:45

Improve CI stability

814bf35

Signed-off-by: Yanchao Lu <[email protected]>

chzblych force-pushed the yanchaol-slurm-output branch from d6a1e13 to 814bf35 Compare August 28, 2025 18:51

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

chzblych enabled auto-merge (squash) August 28, 2025 20:03

chzblych merged commit 460a34c into NVIDIA:main Aug 28, 2025
7 checks passed

chzblych deleted the yanchaol-slurm-output branch August 28, 2025 20:19

[None][chore] Some improvements for CI stability #7199

[None][chore] Some improvements for CI stability #7199

Uh oh!

Conversation

chzblych commented Aug 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

chzblych commented Aug 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

chzblych commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

chzblych commented Aug 25, 2025

Uh oh!

chzblych commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chzblych commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

Uh oh!

chzblych commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chzblych commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

Uh oh!

chzblych commented Aug 28, 2025

chzblych commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 25, 2025 •

edited

Loading