Skip to content

Conversation

chzblych
Copy link
Collaborator

@chzblych chzblych commented Aug 25, 2025

Summary by CodeRabbit

  • Chores

    • Split CI pod timeouts into TEST/BUILD/SLURM (SLURM extended to ~22h); unified build timeout.
    • Per-run multi-node isolation with unique run IDs, per-run scripts/logs, remote capture, and coordinated teardown.
    • Added retry for package installs, richer environment dumps, explicit workspace purges, and clearer "Finished test stage execution" logs.
  • Bug Fixes

    • Improved Slurm handling: capture/validate job IDs, extended online-wait (~15h), and Slurm-aware cleanup for single- and multi-node runs.

@chzblych chzblych requested review from a team as code owners August 25, 2025 05:12
@chzblych chzblych requested review from tburt-nv and kxdc August 25, 2025 05:12
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

📝 Walkthrough

Walkthrough

Refactors pod timeout constants into test/build/slurm scopes; introduces per-run Slurm jobUID, per-run scripts/logs, and server-side extraction/validation/propagation of Slurm job IDs; extends online-wait window; updates cleanup signatures to accept Slurm IDs and run Slurm-specific cleanup; improves logging, workspace hygiene, and pip-install retry.

Changes

Cohort / File(s) Summary of Changes
L0 pipeline (timeouts, Slurm job handling, cleanup, logging)
jenkins/L0_Test.groovy
Replaced POD_TIMEOUT_SECONDS/POD_TIMEOUT_SECONDS_TMP with POD_TIMEOUT_SECONDS_TEST, POD_TIMEOUT_SECONDS_BUILD, POD_TIMEOUT_SECONDS_SLURM; assign timeouts per-container in pod config; add per-run jobUID, per-run slurm_run.sh/slurm_launch.sh and slurm_output.log; upload/cat remote scripts for visibility; parse/validate slurmJobID from submission output (single- and multi-node); extend online-wait loop (12→90 iterations); update cleanup signatures to accept slurmJobID and run scancel/sacct/scontrol; propagate per-run Slurm IDs to final cleanup; add env/pwd/ls dumps, workspace purge steps, delays, and explicit logs; wrap pip-install checks with llmRetry.
Build pod timeout constant
jenkins/Build.groovy
Remove POD_TIMEOUT_SECONDS and POD_TIMEOUT_SECONDS_TMP; add POD_TIMEOUT_SECONDS_BUILD (default 43200); use POD_TIMEOUT_SECONDS_BUILD for build pod sleep/timeout.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant J as Jenkins pipeline
  participant N as Remote node
  participant S as Slurm scheduler
  participant K as Kubernetes

  rect rgb(250,250,255)
    note right of J: prepare per-run jobUID\ncreate per-run workspace & scripts\nlog env/pwd/ls
  end

  J->>N: upload per-run scripts (slurm_run.sh, slurm_launch.sh)
  J->>N: cat uploaded scripts (debug visibility)
  J->>N: execute submission script -> tee per-run slurm_output.log
  N->>S: sbatch / srun submission
  S-->>N: submission output (contains job ID)
  N-->>J: slurm_output.log (captured)
  J->>J: parse slurm_output.log -> extract & validate slurmJobID
  alt Node online within window
    J->>K: run tests (containers use POD_TIMEOUT_SECONDS_TEST / BUILD / SLURM)
  else Timeout / failure
    J->>J: fail pipeline with explicit error and trigger cleanup
  end
  rect rgb(245,255,245)
    note right of J: Finalization — invoke cleanup with slurmJobID\nrun scancel/sacct/scontrol and remove node resources\npurge workspace & logs
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • kxdc
  • tburt-nv
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@chzblych chzblych marked this pull request as draft August 25, 2025 05:12
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

116-118: Confirm Slurm results path compatibility with the new shared-lib

This scp targets /home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml. If the dev branch changed where the agent writes results, this silently degrades to “No results xml to submit” and we lose JUnit data. Recommend confirming the path contract; optionally allow an env override for fast iteration.

Minimal, backward-compatible tweak to enable an override:

-            def resultsFilePath = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml"
-            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
+            def defaultResultsFile = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml"
+            def resultsFilePath = env.SLURM_RESULTS_FILE ?: defaultResultsFile
+            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"

Quick verification during a trial run: check both old/new locations before scp to reduce false negatives.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f61b74f and eaa6be3.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

242-247: Verify library resource availability: slurm_jenkins_agent_setup.sh

This relies on Utils.copyLibraryResource to fetch slurm_jenkins_agent_setup.sh from the shared lib. Ensure the dev branch keeps the same resource path/name; otherwise this step fails early in “Request Node via SLURM.”

Suggested dry-run check during the first job on the new branch:

  • Log the resolved library resource path before chmod.
  • If failure, print “library resource not found in bloom-jenkins-shared-lib@dev-yanchaol-slurm-output” to speed diagnosis.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16379 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16383 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16379 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16383 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12311 (Partly Tested) completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@chzblych chzblych changed the title [None][test] - Test Slurm output file and path [None][chore] Some improvements for CI stability Aug 25, 2025
@chzblych chzblych force-pushed the yanchaol-slurm-output branch from d0980b5 to 9b4e355 Compare August 25, 2025 15:04
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16456 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16457 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16456 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

1-1: Resolved: shared-lib pinned to main (reproducible builds).

You reverted to using bloom-jenkins-shared-lib@main. This addresses the prior concern about pinning to a feature branch and restores reproducibility.

🧹 Nitpick comments (6)
jenkins/L0_Test.groovy (6)

248-249: Gate full script dumps behind a debug flag to avoid leaking internals to the console.

Dumping the entire setup script into the build log increases noise and risks exposing internal paths/infra details. Use the existing DETAILED_LOG flag to control this.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}")
+                }

300-304: Replace fixed sleeps with deterministic destroy+wait logic.

Fixed 30s sleeps add latency without guarantees. Prefer a destroy + bounded wait for node offline to make cleanup predictable and faster when possible.

Apply this diff:

-        Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
-        CloudManager.destroyNode(nodeName)
-        Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-        cleanUpNodeResources(pipeline, cluster, nodeName)
+        timeout(time: 10, unit: 'MINUTES') {
+            // Try destroy, then verify the node goes offline; retry a few times if needed.
+            retry(3) {
+                CloudManager.destroyNode(nodeName)
+                sleep(time: 10, unit: 'SECONDS')
+                if (CloudManager.isNodeOnline(nodeName)) {
+                    error "Node ${nodeName} still online after destroy attempt; retrying..."
+                }
+            }
+        }
+        cleanUpNodeResources(pipeline, cluster, nodeName)

326-326: Minor: gate noisy env/pwd/ls dump behind DETAILED_LOG.

Great for debugging; consider guarding with DETAILED_LOG to keep default logs tight.

Example change:

-    Utils.exec(pipeline, script: "env | sort && pwd && ls -alh")
+    if (testFilter[(DETAILED_LOG)]) {
+        Utils.exec(pipeline, script: "env | sort && pwd && ls -alh")
+    }

370-370: Gate slurm_run.sh content dump behind DETAILED_LOG.

Same rationale as above; reduces log noise and potential info exposure.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}")
+                }

428-428: Gate slurm_launch.sh content dump behind DETAILED_LOG.

The generated script includes environment and paths; dump it only when requested.

Apply this diff:

-                Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}")
+                if (testFilter[(DETAILED_LOG)]) {
+                    Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}")
+                }

322-325: Harden jobUID/suffix for cross-system safety
BUILD_TAG may include characters that aren’t valid in Slurm job names, SSH identifiers, or filesystem paths. To prevent downstream failures, sanitize the suffix to allow only lowercase letters, digits, hyphens (and underscores if desired), then cap the entire jobUID at 63 characters (the common max for many schedulers and filesystems).

Please update lines 322–325 in jenkins/L0_Test.groovy as follows:

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name
+    def suffixRaw = ("${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}").toLowerCase()
+    String customSuffix = suffixRaw.replaceAll('[^a-z0-9-]', '')
+    // Build jobUID and cap length for Slurm/SSH/filesystem compatibility
+    def jobUID = ("${cluster.host}-multi_node_test-${customSuffix}")
+        .replaceAll('[^a-z0-9-_]', '')
+        .take(63)

Optionally, add a runtime check to flag any violations:

#!/bin/bash
# Ensure jobUID only contains [-a-z0-9_] and is ≤63 chars
rg -n 'multi_node_test' --type=groovy -C2
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d0980b5 and 9b4e355.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@chzblych chzblych force-pushed the yanchaol-slurm-output branch 2 times, most recently from e57f754 to f89dfd7 Compare August 25, 2025 15:18
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16458 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16457 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #12364 (Partly Tested) completed with status: 'ABORTED'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16458 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12365 (Partly Tested) completed with status: 'SUCCESS'

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16503 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16867 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16867 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12665 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych force-pushed the yanchaol-slurm-output branch from 0ad26bf to 7ec59b1 Compare August 28, 2025 16:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (13)
jenkins/L0_Test.groovy (13)

47-49: Make pod timeouts scope-specific; don’t let one env var override all three.

Use per-scope overrides with fallback to global, as previously suggested.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

137-169: Guard SLURM cleanup when no job ID; avoid unconditional scancel/sacct.

Skip cleanup if submission failed. This was flagged earlier and still applies.

-        Utils.exec(
-            pipeline,
-            timeout: false,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim()) {
+            Utils.exec(
+                pipeline,
+                timeout: false,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

171-205: Same issue in single-node cleanup: guard on missing slurmJobID.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
@@
-        Utils.exec(
-            pipeline,
-            timeout: false,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim()) {
+            Utils.exec(
+                pipeline,
+                timeout: false,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

284-297: Broaden job-id parsing to include --parsable numeric output.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

347-353: Cleanup: replace shell sleeps; guard cleanup on job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+            echo "Sleeping to allow docker stop..."
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            echo "Sleeping to allow node destruction..."
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

371-377: Sanitize BUILD_TAG-derived suffix for jobUID; enforce [a-z0-9-] and length.

Prevents illegal characters in job names/paths. Apply similar sanitizer to single-node (Lines 243–247).

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    customSuffix = customSuffix.length() > 63 ? customSuffix.substring(0, 63) : customSuffix
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

423-425: Avoid exposing passwords on process command line (scp upload of run script).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)

476-477: Harden pipe: enable pipefail and quote output path.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

480-482: Avoid exposing passwords on process command line (scp upload of launch script).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

501-513: Broaden multi-node job-id extraction; include numeric lines.

-            def slurmJobID = Utils.exec(
+            def slurmJobID = Utils.exec(
                 pipeline,
                 timeout: false,
                 script: Utils.sshUserCmd(
                     remote,
-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"sed -n " +
+                    "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " +
+                    "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                    "${slurmOutputFile} | tail -n1\""
                 ),
                 returnStdout: true
             ).trim()

520-523: Replace shell sleep; guard cleanup on job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
-
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+            echo "Sleeping to allow slurm job termination..."
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

732-732: Quote sleep arg in K8s container command arrays (strings required).

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 750-750, 816-816


2256-2263: Retry mismatch: comment says 2, code retries 1. Align to intent.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
@@
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {
🧹 Nitpick comments (3)
jenkins/L0_Test.groovy (3)

272-273: Nit: avoid dumping script content unless needed.

If the setup script ever contains secrets/URLs, avoid logging it verbatim.


274-282: Prefer sbatch --parsable at source.

If possible, make SlurmConfig.generateCommand(...) use “sbatch --parsable” and capture the numeric ID directly to avoid brittle parsing here and downstream.


303-303: Use Jenkins sleep step instead of shell sleep.

-                Utils.exec(pipeline, script: "echo Sleeping to allow agent initialization; sleep 30")
+                echo "Sleeping to allow agent initialization..."
+                sleep(time: 30, unit: 'SECONDS')
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0ad26bf and 7ec59b1.

📒 Files selected for processing (2)
  • jenkins/Build.groovy (2 hunks)
  • jenkins/L0_Test.groovy (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jenkins/Build.groovy
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
jenkins/L0_Test.groovy (9)

249-251: LGTM: initialize slurmJobID.


298-301: LGTM: validate job ID and fail early.


308-313: Verify wait window alignment (15h) with Slurm/K8s timeouts.

Ensure the 90×10min loop won’t exceed job/pod limits; consider making it env-configurable.


343-344: LGTM: explicit failure on node not online.


490-491: LGTM: run remote launch script via ssh.


495-496: LGTM: clear end-of-stage logging.


515-519: LGTM: failure path shows output then errors.


652-660: LGTM: workspace cleanup on no-result path.


682-690: LGTM: final workspace cleanup and explicit marker.

@chzblych chzblych force-pushed the yanchaol-slurm-output branch from 7ec59b1 to b3f6f64 Compare August 28, 2025 16:44
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16889 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16889 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12686 (Partly Tested) completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator Author

/bot skip --comment "Tested few cases including kill, abort, pass, etc."

@chzblych chzblych marked this pull request as ready for review August 28, 2025 18:09
@chzblych chzblych enabled auto-merge (squash) August 28, 2025 18:13
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16896 [ skip ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)

118-121: Same secret exposure in uploadResults; switch to sshpass -e.

Inline passwords leak via process list/logs.

-            def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
+            def downloadResultCmd = "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
♻️ Duplicate comments (11)
jenkins/L0_Test.groovy (11)

734-734: Quote sleep argument in k8s command arrays (must be strings).

Unquoted values can render non-strings and break the pod spec.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 752-752, 819-819


136-169: Guard SLURM cleanup when job ID is missing; avoid scancel/sacct with empty ID.

Skip remote cleanup if submission failed.

-        Utils.exec(
+        if (slurmJobID?.trim()) {
+        Utils.exec(
             pipeline,
             timeout: false,
             script: Utils.sshUserCmd(
                 remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
             )
-        )
+        )}

171-205: Same: guard single-node cleanup when no job ID.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
 
-        Utils.exec(
+        if (slurmJobID?.trim()) {
+        Utils.exec(
             pipeline,
             timeout: false,
             script: Utils.sshUserCmd(
                 remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
             )
-        )
+        )}

47-49: Use scope-specific pod timeout envs; avoid a single global override.

Prefer per-scope overrides with a global fallback.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

284-297: Broaden Slurm ID parsing; support --parsable (plain numeric) output.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

347-352: Use Jenkins sleep step; guard cleanup when no job ID.

-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }

371-376: Sanitize BUILD_TAG-derived suffix (safe chars, max length); apply same to single-node path.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) customSuffix = customSuffix.substring(0, 63)
     def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

Also mirror this sanitizer in the single-node block (Lines 243–248).


500-525: Verify ID before cleanup; broaden extraction; avoid running cleanup first.

-        stage('Clean up SLURM Resources') {
-            def slurmJobID = Utils.exec(
+        stage('Clean up SLURM Resources') {
+            def slurmJobID = Utils.exec(
                 pipeline,
                 timeout: false,
                 script: Utils.sshUserCmd(
                     remote,
-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"sed -n " +
+                    "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                    "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                    "-e 's/^\\([0-9][0-9]*\\)\$/\\1/p' " +
+                    "${slurmOutputFile} | tail -n1\""
                 ),
                 returnStdout: true
             ).trim()
 
-            Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
-
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
-
-            if (!slurmJobID || !slurmJobID.isNumber()) {
-                Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
-                echo "Slurm job did not submit successfully. No job ID found."
-            }
-
-            Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
+            if (slurmJobID ==~ /^\d+$/) {
+                sleep(time: 30, unit: 'SECONDS')
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID)
+                echo "Slurm job ID: ${slurmJobID}"
+            } else {
+                Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
+                echo "Slurm job did not submit successfully. No job ID found."
+            }
         }

423-425: Avoid exposing passwords on the process command line; use sshpass -e (env var).

-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

Also applies to: 480-482


2258-2264: Retry count mismatch with comment; llmRetry(1) won’t “retry 2 times.”

Align code with intent.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
                             sh "env | sort"
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {
                                 timeout(time: 30, unit: 'MINUTES') {
                                     checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
                                 }
                             })

If llmRetry’s parameter denotes “attempts” not “retries,” adjust accordingly. To verify:

#!/bin/bash
# Locate llmRetry definition and usage semantics
rg -nP 'def\s+llmRetry\s*\(|llmRetry\s*\(' --type groovy -C3

476-477: Preserve failure from srun through tee; quote file path.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b3f6f64 and d10cb8b.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (17 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@chzblych chzblych disabled auto-merge August 28, 2025 18:24
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16896 [ skip ] completed with state SUCCESS
Skipping testing for commit d10cb8b

@chzblych chzblych force-pushed the yanchaol-slurm-output branch from d10cb8b to d6a1e13 Compare August 28, 2025 18:45
Signed-off-by: Yanchao Lu <[email protected]>
@chzblych chzblych force-pushed the yanchaol-slurm-output branch from d6a1e13 to 814bf35 Compare August 28, 2025 18:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (14)
jenkins/L0_Test.groovy (14)

191-223: Guard single-node SLURM cleanup on missing/invalid slurmJobID.

Currently runs sacct/scontrol/scancel unconditionally.

-        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}")
-
-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        slurmJobID = slurmJobID?.trim()
+        Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}")
+        if (slurmJobID ==~ /^\d+$/) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID")
+        }

261-266: Sanitize BUILD_TAG-derived suffix for safe node names/workspaces.

Avoid illegal chars (e.g., “/”, spaces) and length issues.

-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }

302-315: Also accept sbatch --parsable (numeric-only) output in single-node parsing.

Adds robustness when sbatch is invoked with --parsable.

                 def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

47-49: Make timeouts scope-specific; avoid one env overriding all pods.

Let each pod type honor its own override, with a global fallback.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

149-160: Broaden Slurm job-id extraction; handle --parsable and file-missing.

Support plain numeric output and guard when the log file is empty/missing.

-                    "\"sed -n " +
-                    "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                    "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                    "${slurmOutputFile} | tail -n1\""
+                    "\"if [ -s '${slurmOutputFile}' ]; then " +
+                    "  sed -n " +
+                    "    -e 's/^\\([0-9]\\+\\)$/\\1/p' " +                                      # sbatch --parsable
+                    "    -e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
+                    "    -e 's/.*srun: job \\([0-9]\\+\\) .*/\\1/p' " +                         # queued/allocated
+                    "    '${slurmOutputFile}' | tail -n1; " +
+                    "fi\""

Also applies to: 153-157


162-165: Avoid Groovy String.isNumber(); use a strict numeric check.

Prevents false negatives across Groovy versions.

-        if (!slurmJobID || !slurmJobID.isNumber()) {
+        slurmJobID = slurmJobID?.trim()
+        if (!(slurmJobID ==~ /^\d+$/)) {
             Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
             error("Slurm job did not submit successfully. No job ID found.")
         }

288-289: Don’t expose passwords on command line; use sshpass -e or withEnv(SSHPASS=...).

Prevents secret leakage in process lists/logs.

Examples:

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,)

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)

To find any remaining occurrences:

#!/bin/bash
rg -nP "sshpass\s+-p\s+'?\$\{?remote\.passwd\}?" -C2 --type groovy

Also applies to: 431-432, 441-447, 455-456, 498-499


316-319: Same: replace isNumber() with a strict numeric match.

-                if (!slurmJobID || !slurmJobID.isNumber()) {
+                slurmJobID = slurmJobID?.trim()
+                if (!(slurmJobID ==~ /^\d+$/)) {
                     error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}")
                 }

365-370: Use Jenkins sleep and skip cleanup when no job ID.

Aligns with best practices and avoids no-op scancel.

-        stage('Clean up SLURM Resources') {
-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
+        stage('Clean up SLURM Resources') {
+            sleep(time: 30, unit: 'SECONDS')
             CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID?.trim()) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }
         }

389-396: Sanitize jobUID/customSuffix derived from BUILD_TAG (multi-node too).

Mirror the single-node sanitizer to avoid illegal chars and overlong names.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

494-495: Fix pipe exit handling and quote paths in tee.

Ensure srun failures propagate through the pipe.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

729-729: Quote sleep argument in K8s pod command arrays.

Kubernetes expects array of strings; unquoted interpolation may render type mismatches.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 748-748, 813-815


2253-2259: Comment/code mismatch: llmRetry(1) won’t “retry 2 times”.

Align intent or update the comment.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
 ...
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {

If llmRetry’s parameter is “attempts” not “retries”, set to 3 accordingly.


169-177: Use Jenkins sleep and guard cleanup when no job ID.

Reduces flakiness and avoids running Slurm commands with empty IDs.

-        Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
+        sleep(time: 30, unit: 'SECONDS')
-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID?.trim() ==~ /^\d+$/) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID")
+        }
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

518-520: Guard multi-node cleanup if slurmOutputFile is missing.

Avoid remote sed on a null/empty path.

-        stage('Clean up SLURM Resources') {
-            cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile)
+        stage('Clean up SLURM Resources') {
+            if (slurmOutputFile) {
+                cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile)
+            } else {
+                echo "Skip multi-node SLURM cleanup: no slurmOutputFile"
+            }
         }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d10cb8b and d6a1e13.

📒 Files selected for processing (2)
  • jenkins/Build.groovy (2 hunks)
  • jenkins/L0_Test.groovy (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jenkins/Build.groovy
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

649-657: LGTM: added workspace cleanup and clear end-of-stage logs.

Improves hygiene and traceability.

Also applies to: 679-687

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (13)
jenkins/L0_Test.groovy (13)

169-170: Use Jenkins sleep step, not shell sleep.

Cleaner and shows up in Blue Ocean.

-        Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30")
+        sleep(time: 30, unit: 'SECONDS')

191-221: Same: guard single-node cleanup if job ID missing.

Avoid running Slurm commands with an empty ID.

-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

47-49: Use scope-specific pod timeouts; avoid a single env overriding all.

Prefer per-scope overrides with fallback to the global to prevent accidental cross-scope overrides.

-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
-POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
-POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200"  // Use 22 hours to allow for 2 hour of buffer.
+POD_TIMEOUT_SECONDS_TEST  = (env.podTimeoutSecondsTest  ?: env.podTimeoutSeconds ?: "21600")
+POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200")
+POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200")  // 22h with 2h buffer

149-160: Broaden Slurm job ID extraction (support --parsable and other variants).

Handle plain numeric output and “queued and waiting for resources” in addition to existing patterns.

-        def slurmJobID = Utils.exec(
+        def slurmJobID = Utils.exec(
             pipeline,
             script: Utils.sshUserCmd(
                 remote,
-                "\"sed -n " +
-                "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " +
-                "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " +
-                "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " +
-                "${slurmOutputFile} | tail -n1\""
+                "\"sed -n " +
+                "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " +                                                  /* --parsable */
+                "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " +
+                "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " +
+                "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " +
+                "-e 's/.*job \\([0-9][0-9]*\\) queued and waiting for resources.*/\\1/p' " +
+                "${slurmOutputFile} | tail -n1\""
             ),
             returnStdout: true
         ).trim()

162-165: Avoid Groovy String.isNumber(); use a numeric regex.

This is more robust across Groovy versions and avoids NPEs.

-        if (!slurmJobID || !slurmJobID.isNumber()) {
+        slurmJobID = slurmJobID?.trim()
+        if (!(slurmJobID ==~ /^\d+$/)) {
             Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\""))
             error("Slurm job did not submit successfully. No job ID found.")
         }

171-177: Guard cleanup when no job ID.

Skip sacct/scontrol/scancel if submission failed.

-        Utils.exec(
-            pipeline,
-            script: Utils.sshUserCmd(
-                remote,
-                "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
-            )
-        )
+        if (slurmJobID) {
+            Utils.exec(
+                pipeline,
+                script: Utils.sshUserCmd(
+                    remote,
+                    "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\""
+                )
+            )
+        } else {
+            Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID")
+        }

302-315: Accept --parsable output during single-node submit parsing.

Add numeric-only line support.

-                def jobIDs = slurmSubmitOutput
+                def jobIDs = slurmSubmitOutput
                     .readLines()
                     .collect { it.trim() }
                     .collectMany { line ->
                         def ids = []
+                        if (line ==~ /^\d+$/) ids << line
                         def m1 = (line =~ /Submitted batch job (\d+)/)
                         if (m1) ids << m1[0][1]  // Extract the first captured group
                         def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/)
                         if (m2) ids << m2[0][1]  // Extract the first captured group
                         return ids
                     }

316-319: Same: replace isNumber() with regex check.

-                if (!slurmJobID || !slurmJobID.isNumber()) {
+                slurmJobID = slurmJobID?.trim()
+                if (!(slurmJobID ==~ /^\d+$/)) {
                     error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}")
                 }

365-370: Use Jenkins sleep and guard cleanup.

Replace shell sleeps and skip cleanup if no job ID.

-        stage('Clean up SLURM Resources') {
-            Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30")
-            CloudManager.destroyNode(nodeName)
-            Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30")
-            cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
-        }
+        stage('Clean up SLURM Resources') {
+            sleep(time: 30, unit: 'SECONDS')
+            CloudManager.destroyNode(nodeName)
+            sleep(time: 30, unit: 'SECONDS')
+            if (slurmJobID) {
+                cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID)
+            } else {
+                echo "Skip SLURM cleanup: no job ID"
+            }
+        }

389-392: Sanitize BUILD_TAG-derived job UID (avoid illegal chars and length issues).

Prevents slashes/spaces from breaking paths or identifiers.

-    // Create a unique suffix for the job name
-    String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase()
-    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
+    // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63)
+    def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase()
+    String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-')
+    if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }
+    def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"

494-494: Fix unterminated var risk earlier; add pipefail and quote output path.

Ensures srun failures propagate through tee and paths with spaces are safe.

-                    ${srunCmd} 2>&1 | tee ${slurmOutputFile}
+                    set -o pipefail
+                    ${srunCmd} 2>&1 | tee "${slurmOutputFile}"

729-729: Quote sleep args in K8s container commands (array of strings).

K8s expects strings; unquoted numbers can misrender.

-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']
-                    command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}]
+                    command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']

Also applies to: 747-747, 813-813


2253-2259: Align retry count with comment; llmRetry(1) likely doesn’t retry twice.

Make behavior match “Retry 2 times”.

-                            // Retry 2 times if timeout occurs.
+                            // Retry 2 times if timeout occurs.
...
-                            trtllm_utils.llmRetry(1, "checkPipInstall", {
+                            trtllm_utils.llmRetry(2, "checkPipInstall", {
                                 timeout(time: 30, unit: 'MINUTES') {
                                     checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}")
                                 }
                             })
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

324-331: Avoid magic numbers in online-wait; derive attempts or make configurable.

Compute max checks from an env var or derive from duration/sleep interval.

-            while (!CloudManager.isNodeOnline(nodeName) && counter < 90) {
-                // Wait 10 minutes to check status of the node again
-                sleep(time: 10, unit: 'MINUTES')
+            def checkIntervalMin = 10
+            def maxWaitMin = (env.SLURM_ONLINE_WAIT_MINUTES ?: "900") as Integer  // default 15h
+            def maxChecks = (int)(maxWaitMin / checkIntervalMin)
+            while (!CloudManager.isNodeOnline(nodeName) && counter < maxChecks) {
+                sleep(time: checkIntervalMin, unit: 'MINUTES')
                 counter++
             }

649-657: Prefer deleteDir() over shell rm -rf for workspace cleanup.

deleteDir() is portable and reliable in Jenkins workspaces.

-                // Clean up the workspace
-                sh """
-                    env | sort
-                    pwd && ls -alh
-                    rm -rf ./*
-                """
+                // Clean up the workspace
+                sh """
+                    env | sort
+                    pwd && ls -alh
+                """
+                deleteDir()
-        // Clean up the workspace
-        sh """
-            env | sort
-            pwd && ls -alh
-            rm -rf ./*
-        """
+        // Clean up the workspace
+        sh """
+            env | sort
+            pwd && ls -alh
+        """
+        deleteDir()

Also applies to: 679-687

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d6a1e13 and 814bf35.

📒 Files selected for processing (2)
  • jenkins/Build.groovy (2 hunks)
  • jenkins/L0_Test.groovy (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jenkins/Build.groovy
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

513-513: LGTM: helpful completion log.


292-301: Verify SlurmConfig.generateCommand signature and Jenkins.instance.rootUrl access.
SlurmConfig comes from the shared library – confirm its generateCommand method accepts (SlurmCluster, SlurmPartition, nodeSecret, nodeName, rootUrl). Also ensure Jenkins.instance.rootUrl is permitted in the pipeline sandbox (import jenkins.model.Jenkins or switch to env.JENKINS_URL if needed).

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16900 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator Author

/bot skip --comment "Tested a few cases including kill, abort, pass, etc."

@chzblych chzblych enabled auto-merge (squash) August 28, 2025 20:03
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16904 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16900 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16904 [ skip ] completed with state SUCCESS
Skipping testing for commit 814bf35

@chzblych chzblych merged commit 460a34c into NVIDIA:main Aug 28, 2025
7 checks passed
@chzblych chzblych deleted the yanchaol-slurm-output branch August 28, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants