-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][chore] Some improvements for CI stability #7199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughRefactors pod timeout constants into test/build/slurm scopes; introduces per-run Slurm jobUID, per-run scripts/logs, and server-side extraction/validation/propagation of Slurm job IDs; extends online-wait window; updates cleanup signatures to accept Slurm IDs and run Slurm-specific cleanup; improves logging, workspace hygiene, and pip-install retry. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant J as Jenkins pipeline
participant N as Remote node
participant S as Slurm scheduler
participant K as Kubernetes
rect rgb(250,250,255)
note right of J: prepare per-run jobUID\ncreate per-run workspace & scripts\nlog env/pwd/ls
end
J->>N: upload per-run scripts (slurm_run.sh, slurm_launch.sh)
J->>N: cat uploaded scripts (debug visibility)
J->>N: execute submission script -> tee per-run slurm_output.log
N->>S: sbatch / srun submission
S-->>N: submission output (contains job ID)
N-->>J: slurm_output.log (captured)
J->>J: parse slurm_output.log -> extract & validate slurmJobID
alt Node online within window
J->>K: run tests (containers use POD_TIMEOUT_SECONDS_TEST / BUILD / SLURM)
else Timeout / failure
J->>J: fail pipeline with explicit error and trigger cleanup
end
rect rgb(245,255,245)
note right of J: Finalization — invoke cleanup with slurmJobID\nrun scancel/sacct/scontrol and remove node resources\npurge workspace & logs
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
116-118
: Confirm Slurm results path compatibility with the new shared-libThis scp targets /home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml. If the dev branch changed where the agent writes results, this silently degrades to “No results xml to submit” and we lose JUnit data. Recommend confirming the path contract; optionally allow an env override for fast iteration.
Minimal, backward-compatible tweak to enable an override:
- def resultsFilePath = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml" - def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/" + def defaultResultsFile = "/home/svc_tensorrt/bloom/scripts/${nodeName}/results/results.xml" + def resultsFilePath = env.SLURM_RESULTS_FILE ?: defaultResultsFile + def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"Quick verification during a trial run: check both old/new locations before scp to reduce false negatives.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
242-247
: Verify library resource availability: slurm_jenkins_agent_setup.shThis relies on Utils.copyLibraryResource to fetch slurm_jenkins_agent_setup.sh from the shared lib. Ensure the dev branch keeps the same resource path/name; otherwise this step fails early in “Request Node via SLURM.”
Suggested dry-run check during the first job on the new branch:
- Log the resolved library resource path before chmod.
- If failure, print “library resource not found in bloom-jenkins-shared-lib@dev-yanchaol-slurm-output” to speed diagnosis.
PR_Github #16379 [ run ] triggered by Bot |
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16383 [ run ] triggered by Bot |
PR_Github #16379 [ run ] completed with state |
PR_Github #16383 [ run ] completed with state |
/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
d0980b5
to
9b4e355
Compare
/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16456 [ run ] triggered by Bot |
PR_Github #16457 [ run ] triggered by Bot |
PR_Github #16456 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
1-1
: Resolved: shared-lib pinned to main (reproducible builds).You reverted to using bloom-jenkins-shared-lib@main. This addresses the prior concern about pinning to a feature branch and restores reproducibility.
🧹 Nitpick comments (6)
jenkins/L0_Test.groovy (6)
248-249
: Gate full script dumps behind a debug flag to avoid leaking internals to the console.Dumping the entire setup script into the build log increases noise and risks exposing internal paths/infra details. Use the existing DETAILED_LOG flag to control this.
Apply this diff:
- Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}") + if (testFilter[(DETAILED_LOG)]) { + Utils.exec(pipeline, script: "cat ${jenkinsSetupPath}") + }
300-304
: Replace fixed sleeps with deterministic destroy+wait logic.Fixed 30s sleeps add latency without guarantees. Prefer a destroy + bounded wait for node offline to make cleanup predictable and faster when possible.
Apply this diff:
- Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30") - CloudManager.destroyNode(nodeName) - Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30") - cleanUpNodeResources(pipeline, cluster, nodeName) + timeout(time: 10, unit: 'MINUTES') { + // Try destroy, then verify the node goes offline; retry a few times if needed. + retry(3) { + CloudManager.destroyNode(nodeName) + sleep(time: 10, unit: 'SECONDS') + if (CloudManager.isNodeOnline(nodeName)) { + error "Node ${nodeName} still online after destroy attempt; retrying..." + } + } + } + cleanUpNodeResources(pipeline, cluster, nodeName)
326-326
: Minor: gate noisy env/pwd/ls dump behind DETAILED_LOG.Great for debugging; consider guarding with DETAILED_LOG to keep default logs tight.
Example change:
- Utils.exec(pipeline, script: "env | sort && pwd && ls -alh") + if (testFilter[(DETAILED_LOG)]) { + Utils.exec(pipeline, script: "env | sort && pwd && ls -alh") + }
370-370
: Gate slurm_run.sh content dump behind DETAILED_LOG.Same rationale as above; reduces log noise and potential info exposure.
Apply this diff:
- Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}") + if (testFilter[(DETAILED_LOG)]) { + Utils.exec(pipeline, script: "cat ${scriptRunLocalPath}") + }
428-428
: Gate slurm_launch.sh content dump behind DETAILED_LOG.The generated script includes environment and paths; dump it only when requested.
Apply this diff:
- Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}") + if (testFilter[(DETAILED_LOG)]) { + Utils.exec(pipeline, script: "cat ${scriptLaunchDestPath}") + }
322-325
: Harden jobUID/suffix for cross-system safety
BUILD_TAG may include characters that aren’t valid in Slurm job names, SSH identifiers, or filesystem paths. To prevent downstream failures, sanitize the suffix to allow only lowercase letters, digits, hyphens (and underscores if desired), then cap the entire jobUID at 63 characters (the common max for many schedulers and filesystems).Please update lines 322–325 in
jenkins/L0_Test.groovy
as follows:- // Create a unique suffix for the job name - String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() - def jobUID = "${cluster.host}-multi_node_test-${customSuffix}" + // Create a unique, sanitized suffix for the job name + def suffixRaw = ("${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}").toLowerCase() + String customSuffix = suffixRaw.replaceAll('[^a-z0-9-]', '') + // Build jobUID and cap length for Slurm/SSH/filesystem compatibility + def jobUID = ("${cluster.host}-multi_node_test-${customSuffix}") + .replaceAll('[^a-z0-9-_]', '') + .take(63)Optionally, add a runtime check to flag any violations:
#!/bin/bash # Ensure jobUID only contains [-a-z0-9_] and is ≤63 chars rg -n 'multi_node_test' --type=groovy -C2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
e57f754
to
f89dfd7
Compare
/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16458 [ run ] triggered by Bot |
PR_Github #16457 [ run ] completed with state |
PR_Github #16458 [ run ] completed with state |
/bot run --stage-list "GH200-PackageSanityCheck-PY312-DLFW" --disable-fail-fast |
PR_Github #16503 [ run ] triggered by Bot |
PR_Github #16867 [ run ] triggered by Bot |
PR_Github #16867 [ run ] completed with state |
0ad26bf
to
7ec59b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (13)
jenkins/L0_Test.groovy (13)
47-49
: Make pod timeouts scope-specific; don’t let one env var override all three.Use per-scope overrides with fallback to global, as previously suggested.
-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600" -POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200" -POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200" // Use 22 hours to allow for 2 hour of buffer. +POD_TIMEOUT_SECONDS_TEST = (env.podTimeoutSecondsTest ?: env.podTimeoutSeconds ?: "21600") +POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200") +POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200") // 22h with 2h buffer
137-169
: Guard SLURM cleanup when no job ID; avoid unconditional scancel/sacct.Skip cleanup if submission failed. This was flagged earlier and still applies.
- Utils.exec( - pipeline, - timeout: false, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + if (slurmJobID?.trim()) { + Utils.exec( + pipeline, + timeout: false, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID") + }
171-205
: Same issue in single-node cleanup: guard on missing slurmJobID.- Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}") + Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}") @@ - Utils.exec( - pipeline, - timeout: false, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + if (slurmJobID?.trim()) { + Utils.exec( + pipeline, + timeout: false, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID") + }
284-297
: Broaden job-id parsing to include --parsable numeric output.def jobIDs = slurmSubmitOutput .readLines() .collect { it.trim() } .collectMany { line -> def ids = [] + if (line ==~ /^\d+$/) ids << line def m1 = (line =~ /Submitted batch job (\d+)/) if (m1) ids << m1[0][1] // Extract the first captured group def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/) if (m2) ids << m2[0][1] // Extract the first captured group return ids }
347-353
: Cleanup: replace shell sleeps; guard cleanup on job ID.- Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30") + echo "Sleeping to allow docker stop..." + sleep(time: 30, unit: 'SECONDS') CloudManager.destroyNode(nodeName) - Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30") - cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + echo "Sleeping to allow node destruction..." + sleep(time: 30, unit: 'SECONDS') + if (slurmJobID?.trim()) { + cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + } else { + echo "Skip SLURM cleanup: no job ID" + }
371-377
: Sanitize BUILD_TAG-derived suffix for jobUID; enforce [a-z0-9-] and length.Prevents illegal characters in job names/paths. Apply similar sanitizer to single-node (Lines 243–247).
- // Create a unique suffix for the job name - String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() - def jobUID = "${cluster.host}-multi_node_test-${customSuffix}" + // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63) + def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase() + String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-') + customSuffix = customSuffix.length() > 63 ? customSuffix.substring(0, 63) : customSuffix + def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
423-425
: Avoid exposing passwords on process command line (scp upload of run script).- Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,) + Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)
476-477
: Harden pipe: enable pipefail and quote output path.- ${srunCmd} 2>&1 | tee ${slurmOutputFile} + set -o pipefail + ${srunCmd} 2>&1 | tee "${slurmOutputFile}"
480-482
: Avoid exposing passwords on process command line (scp upload of launch script).- Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,) + Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)
501-513
: Broaden multi-node job-id extraction; include numeric lines.- def slurmJobID = Utils.exec( + def slurmJobID = Utils.exec( pipeline, timeout: false, script: Utils.sshUserCmd( remote, - "\"sed -n " + - "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " + - "${slurmOutputFile} | tail -n1\"" + "\"sed -n " + + "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " + + "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " + + "${slurmOutputFile} | tail -n1\"" ), returnStdout: true ).trim()
520-523
: Replace shell sleep; guard cleanup on job ID.- Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30") - - cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID) + echo "Sleeping to allow slurm job termination..." + sleep(time: 30, unit: 'SECONDS') + if (slurmJobID?.trim()) { + cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID) + } else { + echo "Skip SLURM cleanup: no job ID" + }
732-732
: Quote sleep arg in K8s container command arrays (strings required).- command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']Also applies to: 750-750, 816-816
2256-2263
: Retry mismatch: comment says 2, code retries 1. Align to intent.- // Retry 2 times if timeout occurs. + // Retry 2 times if timeout occurs. @@ - trtllm_utils.llmRetry(1, "checkPipInstall", { + trtllm_utils.llmRetry(2, "checkPipInstall", {
🧹 Nitpick comments (3)
jenkins/L0_Test.groovy (3)
272-273
: Nit: avoid dumping script content unless needed.If the setup script ever contains secrets/URLs, avoid logging it verbatim.
274-282
: Prefer sbatch --parsable at source.If possible, make SlurmConfig.generateCommand(...) use “sbatch --parsable” and capture the numeric ID directly to avoid brittle parsing here and downstream.
303-303
: Use Jenkins sleep step instead of shell sleep.- Utils.exec(pipeline, script: "echo Sleeping to allow agent initialization; sleep 30") + echo "Sleeping to allow agent initialization..." + sleep(time: 30, unit: 'SECONDS')
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/Build.groovy
(2 hunks)jenkins/L0_Test.groovy
(17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jenkins/Build.groovy
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
jenkins/L0_Test.groovy (9)
249-251
: LGTM: initialize slurmJobID.
298-301
: LGTM: validate job ID and fail early.
308-313
: Verify wait window alignment (15h) with Slurm/K8s timeouts.Ensure the 90×10min loop won’t exceed job/pod limits; consider making it env-configurable.
343-344
: LGTM: explicit failure on node not online.
490-491
: LGTM: run remote launch script via ssh.
495-496
: LGTM: clear end-of-stage logging.
515-519
: LGTM: failure path shows output then errors.
652-660
: LGTM: workspace cleanup on no-result path.
682-690
: LGTM: final workspace cleanup and explicit marker.
7ec59b1
to
b3f6f64
Compare
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
PR_Github #16889 [ run ] triggered by Bot |
PR_Github #16889 [ run ] completed with state |
/bot skip --comment "Tested few cases including kill, abort, pass, etc." |
PR_Github #16896 [ skip ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)
118-121
: Same secret exposure in uploadResults; switch to sshpass -e.Inline passwords leak via process list/logs.
- def downloadResultCmd = "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/" + def downloadResultCmd = "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host}:${resultsFilePath} ${stageName}/"
♻️ Duplicate comments (11)
jenkins/L0_Test.groovy (11)
734-734
: Quote sleep argument in k8s command arrays (must be strings).Unquoted values can render non-strings and break the pod spec.
- command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']Also applies to: 752-752, 819-819
136-169
: Guard SLURM cleanup when job ID is missing; avoid scancel/sacct with empty ID.Skip remote cleanup if submission failed.
- Utils.exec( + if (slurmJobID?.trim()) { + Utils.exec( pipeline, timeout: false, script: Utils.sshUserCmd( remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" ) - ) + )}
171-205
: Same: guard single-node cleanup when no job ID.- Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}") + Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}") - Utils.exec( + if (slurmJobID?.trim()) { + Utils.exec( pipeline, timeout: false, script: Utils.sshUserCmd( remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" ) - ) + )}
47-49
: Use scope-specific pod timeout envs; avoid a single global override.Prefer per-scope overrides with a global fallback.
-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600" -POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200" -POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200" // Use 22 hours to allow for 2 hour of buffer. +POD_TIMEOUT_SECONDS_TEST = (env.podTimeoutSecondsTest ?: env.podTimeoutSeconds ?: "21600") +POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200") +POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200") // 22h with 2h buffer
284-297
: Broaden Slurm ID parsing; support --parsable (plain numeric) output.def jobIDs = slurmSubmitOutput .readLines() .collect { it.trim() } .collectMany { line -> def ids = [] + if (line ==~ /^\d+$/) ids << line def m1 = (line =~ /Submitted batch job (\d+)/) if (m1) ids << m1[0][1] // Extract the first captured group def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/) if (m2) ids << m2[0][1] // Extract the first captured group return ids }
347-352
: Use Jenkins sleep step; guard cleanup when no job ID.- Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30") + sleep(time: 30, unit: 'SECONDS') CloudManager.destroyNode(nodeName) - Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30") - cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + sleep(time: 30, unit: 'SECONDS') + if (slurmJobID?.trim()) { + cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + } else { + echo "Skip SLURM cleanup: no job ID" + }
371-376
: Sanitize BUILD_TAG-derived suffix (safe chars, max length); apply same to single-node path.- // Create a unique suffix for the job name - String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() + // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63) + def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase() + String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-') + if (customSuffix.length() > 63) customSuffix = customSuffix.substring(0, 63) def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"Also mirror this sanitizer in the single-node block (Lines 243–248).
500-525
: Verify ID before cleanup; broaden extraction; avoid running cleanup first.- stage('Clean up SLURM Resources') { - def slurmJobID = Utils.exec( + stage('Clean up SLURM Resources') { + def slurmJobID = Utils.exec( pipeline, timeout: false, script: Utils.sshUserCmd( remote, - "\"sed -n " + - "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " + - "${slurmOutputFile} | tail -n1\"" + "\"sed -n " + + "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " + + "-e 's/^\\([0-9][0-9]*\\)\$/\\1/p' " + + "${slurmOutputFile} | tail -n1\"" ), returnStdout: true ).trim() - Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30") - - cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID) - - if (!slurmJobID || !slurmJobID.isNumber()) { - Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\"")) - echo "Slurm job did not submit successfully. No job ID found." - } - - Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}") + if (slurmJobID ==~ /^\d+$/) { + sleep(time: 30, unit: 'SECONDS') + cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmJobID) + echo "Slurm job ID: ${slurmJobID}" + } else { + Utils.exec(pipeline, timeout: false, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\"")) + echo "Slurm job did not submit successfully. No job ID found." + } }
423-425
: Avoid exposing passwords on the process command line; use sshpass -e (env var).- Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,) + Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,)- Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,) + Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)Also applies to: 480-482
2258-2264
: Retry count mismatch with comment; llmRetry(1) won’t “retry 2 times.”Align code with intent.
- // Retry 2 times if timeout occurs. + // Retry 2 times if timeout occurs. sh "env | sort" - trtllm_utils.llmRetry(1, "checkPipInstall", { + trtllm_utils.llmRetry(2, "checkPipInstall", { timeout(time: 30, unit: 'MINUTES') { checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") } })If llmRetry’s parameter denotes “attempts” not “retries,” adjust accordingly. To verify:
#!/bin/bash # Locate llmRetry definition and usage semantics rg -nP 'def\s+llmRetry\s*\(|llmRetry\s*\(' --type groovy -C3
476-477
: Preserve failure from srun through tee; quote file path.- ${srunCmd} 2>&1 | tee ${slurmOutputFile} + set -o pipefail + ${srunCmd} 2>&1 | tee "${slurmOutputFile}"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
(17 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #16896 [ skip ] completed with state |
d10cb8b
to
d6a1e13
Compare
Signed-off-by: Yanchao Lu <[email protected]>
d6a1e13
to
814bf35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (14)
jenkins/L0_Test.groovy (14)
191-223
: Guard single-node SLURM cleanup on missing/invalid slurmJobID.Currently runs sacct/scontrol/scancel unconditionally.
- Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID}") - - Utils.exec( - pipeline, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + slurmJobID = slurmJobID?.trim() + Utils.exec(pipeline, script: "echo Slurm job ID: ${slurmJobID ?: 'N/A'}") + if (slurmJobID ==~ /^\d+$/) { + Utils.exec( + pipeline, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID") + }
261-266
: Sanitize BUILD_TAG-derived suffix for safe node names/workspaces.Avoid illegal chars (e.g., “/”, spaces) and length issues.
- String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() + def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase() + String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-') + if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) }
302-315
: Also accept sbatch --parsable (numeric-only) output in single-node parsing.Adds robustness when sbatch is invoked with --parsable.
def jobIDs = slurmSubmitOutput .readLines() .collect { it.trim() } .collectMany { line -> def ids = [] + if (line ==~ /^\d+$/) ids << line def m1 = (line =~ /Submitted batch job (\d+)/) if (m1) ids << m1[0][1] // Extract the first captured group def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/) if (m2) ids << m2[0][1] // Extract the first captured group return ids }
47-49
: Make timeouts scope-specific; avoid one env overriding all pods.Let each pod type honor its own override, with a global fallback.
-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600" -POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200" -POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200" // Use 22 hours to allow for 2 hour of buffer. +POD_TIMEOUT_SECONDS_TEST = (env.podTimeoutSecondsTest ?: env.podTimeoutSeconds ?: "21600") +POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200") +POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200") // 22h with 2h buffer
149-160
: Broaden Slurm job-id extraction; handle --parsable and file-missing.Support plain numeric output and guard when the log file is empty/missing.
- "\"sed -n " + - "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " + - "${slurmOutputFile} | tail -n1\"" + "\"if [ -s '${slurmOutputFile}' ]; then " + + " sed -n " + + " -e 's/^\\([0-9]\\+\\)$/\\1/p' " + # sbatch --parsable + " -e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " + + " -e 's/.*srun: job \\([0-9]\\+\\) .*/\\1/p' " + # queued/allocated + " '${slurmOutputFile}' | tail -n1; " + + "fi\""Also applies to: 153-157
162-165
: Avoid Groovy String.isNumber(); use a strict numeric check.Prevents false negatives across Groovy versions.
- if (!slurmJobID || !slurmJobID.isNumber()) { + slurmJobID = slurmJobID?.trim() + if (!(slurmJobID ==~ /^\d+$/)) { Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\"")) error("Slurm job did not submit successfully. No job ID found.") }
288-289
: Don’t expose passwords on command line; use sshpass -e or withEnv(SSHPASS=...).Prevents secret leakage in process lists/logs.
Examples:
-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${jenkinsSetupPath} ${remote.user}@${remote.host}:~/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh", numRetries: 3,) -Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,) -Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptRunLocalPath} ${remote.user}@${remote.host}:${scriptRunNode}", numRetries: 3,) -Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${testListPath} ${remote.user}@${remote.host}:${testListPathNode}", numRetries: 3,) -Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e scp -r -p ${COMMON_SSH_OPTIONS} ${scriptLaunchDestPath} ${remote.user}@${remote.host}:${scriptLaunch}", numRetries: 3,)To find any remaining occurrences:
#!/bin/bash rg -nP "sshpass\s+-p\s+'?\$\{?remote\.passwd\}?" -C2 --type groovyAlso applies to: 431-432, 441-447, 455-456, 498-499
316-319
: Same: replace isNumber() with a strict numeric match.- if (!slurmJobID || !slurmJobID.isNumber()) { + slurmJobID = slurmJobID?.trim() + if (!(slurmJobID ==~ /^\d+$/)) { error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}") }
365-370
: Use Jenkins sleep and skip cleanup when no job ID.Aligns with best practices and avoids no-op scancel.
- stage('Clean up SLURM Resources') { - Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30") + stage('Clean up SLURM Resources') { + sleep(time: 30, unit: 'SECONDS') CloudManager.destroyNode(nodeName) - Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30") - cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + sleep(time: 30, unit: 'SECONDS') + if (slurmJobID?.trim()) { + cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + } else { + echo "Skip SLURM cleanup: no job ID" + } }
389-396
: Sanitize jobUID/customSuffix derived from BUILD_TAG (multi-node too).Mirror the single-node sanitizer to avoid illegal chars and overlong names.
- // Create a unique suffix for the job name - String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() - def jobUID = "${cluster.host}-multi_node_test-${customSuffix}" + // Create a unique, sanitized suffix for the job name + def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase() + String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-') + if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) } + def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
494-495
: Fix pipe exit handling and quote paths in tee.Ensure srun failures propagate through the pipe.
- ${srunCmd} 2>&1 | tee ${slurmOutputFile} + set -o pipefail + ${srunCmd} 2>&1 | tee "${slurmOutputFile}"
729-729
: Quote sleep argument in K8s pod command arrays.Kubernetes expects array of strings; unquoted interpolation may render type mismatches.
- command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']Also applies to: 748-748, 813-815
2253-2259
: Comment/code mismatch: llmRetry(1) won’t “retry 2 times”.Align intent or update the comment.
- // Retry 2 times if timeout occurs. + // Retry 2 times if timeout occurs. ... - trtllm_utils.llmRetry(1, "checkPipInstall", { + trtllm_utils.llmRetry(2, "checkPipInstall", {If llmRetry’s parameter is “attempts” not “retries”, set to 3 accordingly.
169-177
: Use Jenkins sleep and guard cleanup when no job ID.Reduces flakiness and avoids running Slurm commands with empty IDs.
- Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30") + sleep(time: 30, unit: 'SECONDS') - Utils.exec( - pipeline, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + if (slurmJobID?.trim() ==~ /^\d+$/) { + Utils.exec( + pipeline, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no valid job ID") + }
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
518-520
: Guard multi-node cleanup if slurmOutputFile is missing.Avoid remote sed on a null/empty path.
- stage('Clean up SLURM Resources') { - cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile) + stage('Clean up SLURM Resources') { + if (slurmOutputFile) { + cleanUpNodeResourcesMultiNodes(pipeline, cluster, jobUID, slurmOutputFile) + } else { + echo "Skip multi-node SLURM cleanup: no slurmOutputFile" + } }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/Build.groovy
(2 hunks)jenkins/L0_Test.groovy
(17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jenkins/Build.groovy
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
649-657
: LGTM: added workspace cleanup and clear end-of-stage logs.Improves hygiene and traceability.
Also applies to: 679-687
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (13)
jenkins/L0_Test.groovy (13)
169-170
: Use Jenkins sleep step, not shell sleep.Cleaner and shows up in Blue Ocean.
- Utils.exec(pipeline, script: "echo Sleeping to allow slurm job termination; sleep 30") + sleep(time: 30, unit: 'SECONDS')
191-221
: Same: guard single-node cleanup if job ID missing.Avoid running Slurm commands with an empty ID.
- Utils.exec( - pipeline, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + if (slurmJobID) { + Utils.exec( + pipeline, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID") + }
47-49
: Use scope-specific pod timeouts; avoid a single env overriding all.Prefer per-scope overrides with fallback to the global to prevent accidental cross-scope overrides.
-POD_TIMEOUT_SECONDS_TEST = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600" -POD_TIMEOUT_SECONDS_BUILD = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200" -POD_TIMEOUT_SECONDS_SLURM = env.podTimeoutSeconds ? env.podTimeoutSeconds : "79200" // Use 22 hours to allow for 2 hour of buffer. +POD_TIMEOUT_SECONDS_TEST = (env.podTimeoutSecondsTest ?: env.podTimeoutSeconds ?: "21600") +POD_TIMEOUT_SECONDS_BUILD = (env.podTimeoutSecondsBuild ?: env.podTimeoutSeconds ?: "43200") +POD_TIMEOUT_SECONDS_SLURM = (env.podTimeoutSecondsSlurm ?: env.podTimeoutSeconds ?: "79200") // 22h with 2h buffer
149-160
: Broaden Slurm job ID extraction (support --parsable and other variants).Handle plain numeric output and “queued and waiting for resources” in addition to existing patterns.
- def slurmJobID = Utils.exec( + def slurmJobID = Utils.exec( pipeline, script: Utils.sshUserCmd( remote, - "\"sed -n " + - "-e 's/.*Submitted batch job \\([0-9]\\+\\).*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) queued.*/\\1/p' " + - "-e 's/.*srun: job \\([0-9]\\+\\) has been allocated.*/\\1/p' " + - "${slurmOutputFile} | tail -n1\"" + "\"sed -n " + + "-e 's/^\\([0-9][0-9]*\\)$/\\1/p' " + /* --parsable */ + "-e 's/.*Submitted batch job \\([0-9][0-9]*\\).*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) queued.*/\\1/p' " + + "-e 's/.*srun: job \\([0-9][0-9]*\\) has been allocated.*/\\1/p' " + + "-e 's/.*job \\([0-9][0-9]*\\) queued and waiting for resources.*/\\1/p' " + + "${slurmOutputFile} | tail -n1\"" ), returnStdout: true ).trim()
162-165
: Avoid Groovy String.isNumber(); use a numeric regex.This is more robust across Groovy versions and avoids NPEs.
- if (!slurmJobID || !slurmJobID.isNumber()) { + slurmJobID = slurmJobID?.trim() + if (!(slurmJobID ==~ /^\d+$/)) { Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"cat ${slurmOutputFile}\"")) error("Slurm job did not submit successfully. No job ID found.") }
171-177
: Guard cleanup when no job ID.Skip sacct/scontrol/scancel if submission failed.
- Utils.exec( - pipeline, - script: Utils.sshUserCmd( - remote, - "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" - ) - ) + if (slurmJobID) { + Utils.exec( + pipeline, + script: Utils.sshUserCmd( + remote, + "\"scancel ${slurmJobID} || true; sacct -j ${slurmJobID} --format=JobID,JobName%100,Partition%15,Account%15,State,ExitCode,NodeList%30 || true; scontrol show job ${slurmJobID} || true\"" + ) + ) + } else { + Utils.exec(pipeline, script: "echo Skip SLURM cleanup: no job ID") + }
302-315
: Accept --parsable output during single-node submit parsing.Add numeric-only line support.
- def jobIDs = slurmSubmitOutput + def jobIDs = slurmSubmitOutput .readLines() .collect { it.trim() } .collectMany { line -> def ids = [] + if (line ==~ /^\d+$/) ids << line def m1 = (line =~ /Submitted batch job (\d+)/) if (m1) ids << m1[0][1] // Extract the first captured group def m2 = (line =~ /srun: job (\d+) (queued|has been allocated)/) if (m2) ids << m2[0][1] // Extract the first captured group return ids }
316-319
: Same: replace isNumber() with regex check.- if (!slurmJobID || !slurmJobID.isNumber()) { + slurmJobID = slurmJobID?.trim() + if (!(slurmJobID ==~ /^\d+$/)) { error("Slurm job did not submit successfully. No job ID found.\nSubmission output:\n${slurmSubmitOutput}") }
365-370
: Use Jenkins sleep and guard cleanup.Replace shell sleeps and skip cleanup if no job ID.
- stage('Clean up SLURM Resources') { - Utils.exec(pipeline, script: "echo Sleeping to allow docker stop; sleep 30") - CloudManager.destroyNode(nodeName) - Utils.exec(pipeline, script: "echo Sleeping to allow node destruction; sleep 30") - cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) - } + stage('Clean up SLURM Resources') { + sleep(time: 30, unit: 'SECONDS') + CloudManager.destroyNode(nodeName) + sleep(time: 30, unit: 'SECONDS') + if (slurmJobID) { + cleanUpNodeResources(pipeline, cluster, nodeName, slurmJobID) + } else { + echo "Skip SLURM cleanup: no job ID" + } + }
389-392
: Sanitize BUILD_TAG-derived job UID (avoid illegal chars and length issues).Prevents slashes/spaces from breaking paths or identifiers.
- // Create a unique suffix for the job name - String customSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll("-", "").substring(0, 6)}".toLowerCase() - def jobUID = "${cluster.host}-multi_node_test-${customSuffix}" + // Create a unique, sanitized suffix for the job name (lowercase, [a-z0-9-], max 63) + def rawSuffix = "${env.BUILD_TAG}-${UUID.randomUUID().toString().replaceAll('-', '').substring(0, 6)}".toLowerCase() + String customSuffix = rawSuffix.replaceAll('[^a-z0-9-]', '-').replaceAll('-+', '-') + if (customSuffix.length() > 63) { customSuffix = customSuffix.substring(0, 63) } + def jobUID = "${cluster.host}-multi_node_test-${customSuffix}"
494-494
: Fix unterminated var risk earlier; add pipefail and quote output path.Ensures srun failures propagate through tee and paths with spaces are safe.
- ${srunCmd} 2>&1 | tee ${slurmOutputFile} + set -o pipefail + ${srunCmd} 2>&1 | tee "${slurmOutputFile}"
729-729
: Quote sleep args in K8s container commands (array of strings).K8s expects strings; unquoted numbers can misrender.
- command: ['sleep', ${POD_TIMEOUT_SECONDS_SLURM}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_SLURM}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_BUILD}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_BUILD}']- command: ['sleep', ${POD_TIMEOUT_SECONDS_TEST}] + command: ['sleep', '${POD_TIMEOUT_SECONDS_TEST}']Also applies to: 747-747, 813-813
2253-2259
: Align retry count with comment; llmRetry(1) likely doesn’t retry twice.Make behavior match “Retry 2 times”.
- // Retry 2 times if timeout occurs. + // Retry 2 times if timeout occurs. ... - trtllm_utils.llmRetry(1, "checkPipInstall", { + trtllm_utils.llmRetry(2, "checkPipInstall", { timeout(time: 30, unit: 'MINUTES') { checkPipInstall(pipeline, "${cpu_arch}/${wheelPath}") } })
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
324-331
: Avoid magic numbers in online-wait; derive attempts or make configurable.Compute max checks from an env var or derive from duration/sleep interval.
- while (!CloudManager.isNodeOnline(nodeName) && counter < 90) { - // Wait 10 minutes to check status of the node again - sleep(time: 10, unit: 'MINUTES') + def checkIntervalMin = 10 + def maxWaitMin = (env.SLURM_ONLINE_WAIT_MINUTES ?: "900") as Integer // default 15h + def maxChecks = (int)(maxWaitMin / checkIntervalMin) + while (!CloudManager.isNodeOnline(nodeName) && counter < maxChecks) { + sleep(time: checkIntervalMin, unit: 'MINUTES') counter++ }
649-657
: Prefer deleteDir() over shell rm -rf for workspace cleanup.deleteDir() is portable and reliable in Jenkins workspaces.
- // Clean up the workspace - sh """ - env | sort - pwd && ls -alh - rm -rf ./* - """ + // Clean up the workspace + sh """ + env | sort + pwd && ls -alh + """ + deleteDir()- // Clean up the workspace - sh """ - env | sort - pwd && ls -alh - rm -rf ./* - """ + // Clean up the workspace + sh """ + env | sort + pwd && ls -alh + """ + deleteDir()Also applies to: 679-687
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/Build.groovy
(2 hunks)jenkins/L0_Test.groovy
(17 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jenkins/Build.groovy
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
513-513
: LGTM: helpful completion log.
292-301
: Verify SlurmConfig.generateCommand signature and Jenkins.instance.rootUrl access.
SlurmConfig comes from the shared library – confirm its generateCommand method accepts (SlurmCluster, SlurmPartition, nodeSecret, nodeName, rootUrl). Also ensure Jenkins.instance.rootUrl is permitted in the pipeline sandbox (import jenkins.model.Jenkins or switch to env.JENKINS_URL if needed).
PR_Github #16900 [ run ] triggered by Bot |
/bot skip --comment "Tested a few cases including kill, abort, pass, etc." |
PR_Github #16904 [ skip ] triggered by Bot |
PR_Github #16900 [ run ] completed with state |
PR_Github #16904 [ skip ] completed with state |
Summary by CodeRabbit
Chores
Bug Fixes