[None][infra] Fix slurm exitcode #8585

EmmaQiaoCh · 2025-10-22T13:33:23Z

Summary by CodeRabbit

Chores
- Enhanced test infrastructure with updated timeout configuration and improved error handling for test execution.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: qqiao <[email protected]>

EmmaQiaoCh · 2025-10-22T13:34:28Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

coderabbitai · 2025-10-22T13:37:29Z

📝 Walkthrough

Walkthrough

This change reduces the default pytest timeout from 3600 to 600 seconds in the Jenkins test configuration and ensures the slurm_run.sh script properly captures and propagates the pytest command exit code with completion logging.

Changes

Cohort / File(s)	Change Summary
Test timeout configuration `jenkins/L0_Test.groovy`	Reduced default pytest timeout from 3600 to 600 seconds; CPP test timeout override updated accordingly
Exit code propagation `jenkins/scripts/slurm_run.sh`	Added exit code capture after pytest command execution, logs rank-specific completion message, and propagates exit status via `exit $returnCode`

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is largely incomplete and does not meet the repository's template requirements. The author submitted only the template structure with all critical content sections left empty: the "Description" section contains no explanation of the issue or solution, the "Test Coverage" section contains no information about relevant tests, and while the PR Checklist is marked as complete with [x], no verification details or context is provided. The description does not convey what changes are being made, why they are necessary, or how they are tested.	Please provide a complete PR description by filling in the "Description" section with an explanation of the issue being fixed and the solution implemented. Add the "Test Coverage" section describing what tests validate these changes (specifically, tests that verify the exit code is now properly propagated and the timeout reduction in L0_Test.groovy). Additionally, explain which PR Checklist items apply to this change and confirm that they have been addressed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[None][infra] Fix slurm exitcode" follows the required template format with a valid ticket format ([None]), a valid type (infra), and a clear summary. The title directly corresponds to the main change in the pull request, which is fixing exit code handling in the slurm script (jenkins/scripts/slurm_run.sh). The title is concise, specific enough to convey the primary change, and a teammate scanning the history would understand the intent. While a secondary change reduces the pytest timeout in L0_Test.groovy, the primary focus of the PR is the exit code fix, which is accurately reflected in the title.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

jenkins/scripts/slurm_run.sh (1)
107-109: LGTM with a minor suggestion for improved logging.

The exit code capture is correct and ensures the pytest result is properly preserved. Consider including the exit code in the completion log for better debugging visibility.

Apply this diff to include the exit code in the log message:
 eval $pytestCommand
 returnCode=$?
-echo "Rank${SLURM_PROCID} Pytest finished execution"
+echo "Rank${SLURM_PROCID} Pytest finished execution with exit code: $returnCode"

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 879039f and 64ba6ef.

📒 Files selected for processing (2)

jenkins/L0_Test.groovy (1 hunks)
jenkins/scripts/slurm_run.sh (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

jenkins/scripts/slurm_run.sh (1)

128-128: Excellent fix for exit code propagation.

This ensures the script exits with the pytest command's status, even when performance checks run afterwards. Without this, the exit code would reflect the last executed command (perf checks), potentially masking pytest failures. This directly addresses the PR objective "Fix slurm exitcode."

jenkins/L0_Test.groovy

tensorrt-cicd · 2025-10-22T13:40:04Z

PR_Github #22187 [ run ] triggered by Bot. Commit: 64ba6ef

tensorrt-cicd · 2025-10-22T15:32:11Z

PR_Github #22187 [ run ] completed with state SUCCESS. Commit: 64ba6ef
/LLM/main/L0_MergeRequest_PR pipeline #16729 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: qqiao <[email protected]>

EmmaQiaoCh · 2025-10-23T00:54:23Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-10-23T01:00:17Z

PR_Github #22222 [ run ] triggered by Bot. Commit: 088d2c5

tensorrt-cicd · 2025-10-23T02:56:24Z

PR_Github #22222 [ run ] completed with state FAILURE. Commit: 088d2c5
/LLM/main/L0_MergeRequest_PR pipeline #16755 (Partly Tested) completed with status: 'FAILURE'

Signed-off-by: Emma Qiao <[email protected]>

EmmaQiaoCh · 2025-10-23T03:42:24Z

/bot skip --comment "Tested one gb200 slurm job"

tensorrt-cicd · 2025-10-23T03:48:01Z

PR_Github #22237 [ skip ] triggered by Bot. Commit: 5f72104

jenkins/scripts/slurm_run.sh

tensorrt-cicd · 2025-10-23T04:19:11Z

PR_Github #22237 [ skip ] completed with state SUCCESS. Commit: 5f72104
Skipping testing for commit 5f72104

Added a log message to indicate completion of pytest execution. Signed-off-by: Emma Qiao <[email protected]>

EmmaQiaoCh · 2025-10-23T05:50:28Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-10-23T05:56:03Z

PR_Github #22246 [ run ] triggered by Bot. Commit: 7864d20

tensorrt-cicd · 2025-10-23T08:14:51Z

PR_Github #22246 [ run ] completed with state FAILURE. Commit: 7864d20
/LLM/main/L0_MergeRequest_PR pipeline #16772 (Partly Tested) completed with status: 'FAILURE'

EmmaQiaoCh added 2 commits October 22, 2025 06:31

Return error when pytest command fail

5533726

Signed-off-by: qqiao <[email protected]>

Update for testing

64ba6ef

Signed-off-by: qqiao <[email protected]>

EmmaQiaoCh requested review from a team as code owners October 22, 2025 13:33

EmmaQiaoCh requested review from niukuo and zeroepoch October 22, 2025 13:33

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

zeroepoch approved these changes Oct 22, 2025

View reviewed changes

Remove some scripts

088d2c5

Signed-off-by: qqiao <[email protected]>

Change back timeout for test_nvfp4 latency_moe_trtllm

5f72104

Signed-off-by: Emma Qiao <[email protected]>

chzblych approved these changes Oct 23, 2025

View reviewed changes

jenkins/scripts/slurm_run.sh Show resolved Hide resolved

Log pytest completion with SLURM process ID

7864d20

Added a log message to indicate completion of pytest execution. Signed-off-by: Emma Qiao <[email protected]>

[None][infra] Fix slurm exitcode #8585

Are you sure you want to change the base?

[None][infra] Fix slurm exitcode #8585

Conversation

EmmaQiaoCh commented Oct 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

EmmaQiaoCh commented Oct 22, 2025

Uh oh!

coderabbitai bot commented Oct 22, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Oct 22, 2025

Uh oh!

tensorrt-cicd commented Oct 22, 2025

Uh oh!

EmmaQiaoCh commented Oct 23, 2025

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

EmmaQiaoCh commented Oct 23, 2025

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

EmmaQiaoCh commented Oct 23, 2025

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

tensorrt-cicd commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EmmaQiaoCh commented Oct 22, 2025 •

edited by coderabbitai bot

Loading