Skip to content

Conversation

EmmaQiaoCh
Copy link
Collaborator

@EmmaQiaoCh EmmaQiaoCh commented Oct 22, 2025

Summary by CodeRabbit

  • Chores
    • Enhanced test infrastructure with updated timeout configuration and improved error handling for test execution.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@EmmaQiaoCh EmmaQiaoCh requested review from a team as code owners October 22, 2025 13:33
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

📝 Walkthrough

Walkthrough

This change reduces the default pytest timeout from 3600 to 600 seconds in the Jenkins test configuration and ensures the slurm_run.sh script properly captures and propagates the pytest command exit code with completion logging.

Changes

Cohort / File(s) Change Summary
Test timeout configuration
jenkins/L0_Test.groovy
Reduced default pytest timeout from 3600 to 600 seconds; CPP test timeout override updated accordingly
Exit code propagation
jenkins/scripts/slurm_run.sh
Added exit code capture after pytest command execution, logs rank-specific completion message, and propagates exit status via exit $returnCode

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely incomplete and does not meet the repository's template requirements. The author submitted only the template structure with all critical content sections left empty: the "Description" section contains no explanation of the issue or solution, the "Test Coverage" section contains no information about relevant tests, and while the PR Checklist is marked as complete with [x], no verification details or context is provided. The description does not convey what changes are being made, why they are necessary, or how they are tested. Please provide a complete PR description by filling in the "Description" section with an explanation of the issue being fixed and the solution implemented. Add the "Test Coverage" section describing what tests validate these changes (specifically, tests that verify the exit code is now properly propagated and the timeout reduction in L0_Test.groovy). Additionally, explain which PR Checklist items apply to this change and confirm that they have been addressed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[None][infra] Fix slurm exitcode" follows the required template format with a valid ticket format ([None]), a valid type (infra), and a clear summary. The title directly corresponds to the main change in the pull request, which is fixing exit code handling in the slurm script (jenkins/scripts/slurm_run.sh). The title is concise, specific enough to convey the primary change, and a teammate scanning the history would understand the intent. While a secondary change reduces the pytest timeout in L0_Test.groovy, the primary focus of the PR is the exit code fix, which is accurately reflected in the title.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
jenkins/scripts/slurm_run.sh (1)

107-109: LGTM with a minor suggestion for improved logging.

The exit code capture is correct and ensures the pytest result is properly preserved. Consider including the exit code in the completion log for better debugging visibility.

Apply this diff to include the exit code in the log message:

 eval $pytestCommand
 returnCode=$?
-echo "Rank${SLURM_PROCID} Pytest finished execution"
+echo "Rank${SLURM_PROCID} Pytest finished execution with exit code: $returnCode"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 879039f and 64ba6ef.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (1 hunks)
  • jenkins/scripts/slurm_run.sh (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/scripts/slurm_run.sh (1)

128-128: Excellent fix for exit code propagation.

This ensures the script exits with the pytest command's status, even when performance checks run afterwards. Without this, the exit code would reflect the last executed command (perf checks), potentially masking pytest failures. This directly addresses the PR objective "Fix slurm exitcode."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22187 [ run ] triggered by Bot. Commit: 64ba6ef

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22187 [ run ] completed with state SUCCESS. Commit: 64ba6ef
/LLM/main/L0_MergeRequest_PR pipeline #16729 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: qqiao <[email protected]>
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22222 [ run ] triggered by Bot. Commit: 088d2c5

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22222 [ run ] completed with state FAILURE. Commit: 088d2c5
/LLM/main/L0_MergeRequest_PR pipeline #16755 (Partly Tested) completed with status: 'FAILURE'

@EmmaQiaoCh
Copy link
Collaborator Author

/bot skip --comment "Tested one gb200 slurm job"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22237 [ skip ] triggered by Bot. Commit: 5f72104

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22237 [ skip ] completed with state SUCCESS. Commit: 5f72104
Skipping testing for commit 5f72104

Added a log message to indicate completion of pytest execution.

Signed-off-by: Emma Qiao <[email protected]>
@EmmaQiaoCh
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22246 [ run ] triggered by Bot. Commit: 7864d20

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22246 [ run ] completed with state FAILURE. Commit: 7864d20
/LLM/main/L0_MergeRequest_PR pipeline #16772 (Partly Tested) completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants