-
Notifications
You must be signed in to change notification settings - Fork 680
feat: add error message propogation #4195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis workflow update adds comprehensive error handling, logging, and GitHub annotations to the container validation pipeline. It introduces Helm availability checks, captures diagnostic information on failures, and propagates detailed error messages to GitHub via check-runs and inline annotations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/container-validation-backends.yml (1)
584-694: Variable naming inconsistency and missing log artifact preservation in test step.Line 669 introduces
ERROR_MSG=""but line 691 usesERROR_MESSAGE=. Additionally,test-output.logis created on line 594 viatee -abut is never uploaded as a workflow artifact. This log will be lost after the job completes. Consider uploading logs as artifacts for post-job diagnostics.Standardize variable names and add log artifact uploads:
# After the test run completes, upload logs - name: Upload Test Logs if: always() uses: actions/upload-artifact@v4 with: name: test-output-logs-${{ matrix.profile }} path: test-output.log retention-days: 7
♻️ Duplicate comments (1)
.github/workflows/container-validation-backends.yml (1)
470-483: Unescaped diagnostic output in GITHUB_ENV (same issue as helm install section).Diagnostic outputs from
kubectl getcommands are concatenated intoERROR_MESSAGEwithout escaping. This is a continuation of the issue flagged in the helm install section above. Consider a more robust approach for preserving diagnostic context.
🧹 Nitpick comments (2)
.github/workflows/container-validation-backends.yml (2)
486-548: Significant code duplication between deploy-operator and test failure annotation steps.Both steps follow nearly identical patterns: extract ERROR_MESSAGE, gather diagnostics via kubectl/jq, create GitHub check-run, emit error annotation, exit. This is a maintenance liability—future changes to error handling must be duplicated across both steps.
Consider extracting into a reusable composite action (
.github/actions/create-failure-annotation/action.yml) that accepts error message, file path, and line ranges, then calls the composite from both jobs.Also applies to: 696-764
375-375: Log files created via tee but not preserved as artifacts.Both
deploy-operator.log(line 375) andtest-output.log(line 594) are valuable for post-failure diagnostics but are discarded after the job completes. Upload these as artifacts.Add upload-artifact steps in the Cleanup sections or at the end of each job:
- name: Upload Operator Deployment Logs if: always() uses: actions/upload-artifact@v4 with: name: deploy-operator-logs-${{ github.run_id }} path: deploy-operator.log retention-days: 7Also applies to: 594-594
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/container-validation-backends.yml(5 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4195/merge) by nv-nmailhot.
.github/workflows/container-validation-backends.yml
[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. Fix reproduced by running pre-commit locally.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: operator (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
.github/workflows/container-validation-backends.yml (3)
417-444: Error handling for Helm setup is well-structured.The progressive validation (availability → repo → dependencies) with early exit and message propagation is solid.
484-484: Correct error capture pattern with continue-on-error.The
continue-on-error: truecombined with conditional step (if: steps.deploy-operator-step.outcome == 'failure') properly enables error context capture without masking the failure.Also applies to: 486-488
496-515: Namespace context may be unavailable if deploy-operator fails early.The code checks
if [ -n "$NAMESPACE" ]before using it, which is good. However, if the deploy-operator step fails beforeNAMESPACEis output (line 380), the annotation step will skip diagnostic context gathering. Consider exportingNAMESPACEearlier or passing it explicitly between steps.Verify whether all code paths in the deploy-operator step establish NAMESPACE before any
exitstatement.
| # Redirect all output to a log file while still showing it | ||
| exec > >(tee -a deploy-operator.log) 2>&1 | ||
| # Set namespace using branch | ||
| BRANCH_SANITIZED="${BRANCH/\//-}" | ||
| NAMESPACE="gh-job-id-${{ github.run_id }}-${BRANCH_SANITIZED}-deploy-tests" | ||
| echo "namespace=${NAMESPACE}" >> "$GITHUB_OUTPUT" | ||
| echo "NAMESPACE=${NAMESPACE}" >> $GITHUB_ENV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix trailing whitespace from pre-commit hook.
The pipeline failure indicates trailing whitespace exists in this section. Run pre-commit run --all-files locally to identify and fix all instances before committing.
🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml around lines 373 to 381:
there are trailing whitespace characters in this block (the redirect and
namespace export lines) causing the pre-commit failure — remove the trailing
spaces at the end of the affected lines (ensure no extra spaces after each
line), run `pre-commit run --all-files` locally to verify the hook passes, and
commit the fixed file.
| echo "Installing dynamo-platform Helm chart..." | ||
| if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \ | ||
| --set dynamo-operator.namespaceRestriction.enabled=true \ | ||
| --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \ | ||
| --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \ | ||
| --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \ | ||
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret | ||
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then | ||
| ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints." | ||
| echo "$ERROR_MSG" | ||
| # Capture additional diagnostics | ||
| echo "=== Pod Status ===" | ||
| kubectl get pods -n ${NAMESPACE} -o wide || true | ||
| echo "=== Events ===" | ||
| kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true | ||
| echo "=== Helm Status ===" | ||
| helm status dynamo-platform -n ${NAMESPACE} || true | ||
| echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable naming inconsistency and potential unsafe output concatenation.
Line 454 uses ERROR_MSG instead of ERROR_MESSAGE (used elsewhere). More critically, diagnostic outputs (pods, events) are concatenated into ERROR_MESSAGE without sanitization before being stored in GITHUB_ENV. If any output contains newlines or special characters, this could corrupt the environment variable.
Standardize the variable name and escape diagnostic output:
if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \
--set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \
--set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then
- ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints."
+ ERROR_MESSAGE="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints."
echo "$ERROR_MSG"For the diagnostic capture, consider logging to a file artifact instead of concatenating into a single environment variable.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "Installing dynamo-platform Helm chart..." | |
| if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \ | |
| --set dynamo-operator.namespaceRestriction.enabled=true \ | |
| --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \ | |
| --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \ | |
| --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \ | |
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret | |
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then | |
| ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints." | |
| echo "$ERROR_MSG" | |
| # Capture additional diagnostics | |
| echo "=== Pod Status ===" | |
| kubectl get pods -n ${NAMESPACE} -o wide || true | |
| echo "=== Events ===" | |
| kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true | |
| echo "=== Helm Status ===" | |
| helm status dynamo-platform -n ${NAMESPACE} || true | |
| echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV | |
| exit 1 | |
| fi | |
| echo "Installing dynamo-platform Helm chart..." | |
| if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \ | |
| --set dynamo-operator.namespaceRestriction.enabled=true \ | |
| --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \ | |
| --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \ | |
| --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \ | |
| --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then | |
| ERROR_MESSAGE="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints." | |
| echo "$ERROR_MESSAGE" | |
| # Capture additional diagnostics | |
| echo "=== Pod Status ===" | |
| kubectl get pods -n ${NAMESPACE} -o wide || true | |
| echo "=== Events ===" | |
| kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true | |
| echo "=== Helm Status ===" | |
| helm status dynamo-platform -n ${NAMESPACE} || true | |
| echo "ERROR_MESSAGE=$ERROR_MESSAGE" >> $GITHUB_ENV | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
In .github/workflows/container-validation-backends.yml around lines 447 to 467,
replace the inconsistent ERROR_MSG with ERROR_MESSAGE and stop writing raw
multi-line diagnostic output into GITHUB_ENV; instead capture the
pod/events/helm outputs to separate files (or a single diagnostics file) and
upload them as workflow artifacts (or base64-encode/sanitize before exporting)
while only writing a single-line, safe ERROR_MESSAGE to GITHUB_ENV (e.g.,
"Failed to install dynamo-platform; diagnostics saved to <path>" or the
base64-encoded diagnostics string) to avoid corrupting the environment file.
| POD_STATUS=$(kubectl get pods -n $NAMESPACE -o json 2>/dev/null || echo "{}") | ||
| POD_ERRORS=$(echo "$POD_STATUS" | jq -r '.items[] | select(.status.phase != "Running") | "Pod: \(.metadata.name), Status: \(.status.phase), Reason: \(.status.containerStatuses[0].state.waiting.reason // .status.reason // "N/A"), Message: \(.status.containerStatuses[0].state.waiting.message // "N/A")"' 2>/dev/null | head -10 || echo "") | ||
| if [ -n "$POD_ERRORS" ]; then | ||
| ERROR_MESSAGE="$ERROR_MESSAGE\n\nPod Status:\n$POD_ERRORS" | ||
| fi | ||
| # Get recent events | ||
| EVENTS=$(kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' -o json 2>/dev/null || echo "{}") | ||
| ERROR_EVENTS=$(echo "$EVENTS" | jq -r '.items[] | select(.type == "Warning" or .type == "Error") | "\(.lastTimestamp) - \(.reason): \(.message)"' 2>/dev/null | tail -5 || echo "") | ||
| if [ -n "$ERROR_EVENTS" ]; then | ||
| ERROR_MESSAGE="$ERROR_MESSAGE\n\nRecent Events:\n$ERROR_EVENTS" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diagnostic outputs appended to ERROR_MESSAGE without escaping.
Pod and event information are concatenated into ERROR_MESSAGE (lines 505, 513) without escaping newlines or special characters. This exacerbates the JSON escaping issue flagged earlier. When ERROR_MESSAGE is later embedded in the GitHub API payload, these unescaped values will corrupt the JSON.
Consider building a structured diagnostic object separately or properly escaping before concatenation.
| # Create a check run with the annotation | ||
| CHECK_RUN_ID=$(curl -s -X POST \ | ||
| -H "Authorization: token $GITHUB_TOKEN" \ | ||
| -H "Accept: application/vnd.github.v3+json" \ | ||
| "https://api.github.com/repos/${{ github.repository }}/check-runs" \ | ||
| -d '{ | ||
| "name": "Deploy Operator", | ||
| "head_sha": "${{ github.sha }}", | ||
| "status": "completed", | ||
| "conclusion": "failure", | ||
| "output": { | ||
| "title": "Operator Deployment Failed", | ||
| "summary": "Failed to deploy dynamo-platform operator to namespace '"${NAMESPACE}"'", | ||
| "text": "**Job**: deploy-operator\n**Namespace**: '"${NAMESPACE}"'\n\n**Error Details**:\n```\n'"${ERROR_MESSAGE}"'\n```\n\n[View Job Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})", | ||
| "annotations": [{ | ||
| "path": ".github/workflows/container-validation-backends.yml", | ||
| "start_line": 357, | ||
| "end_line": 425, | ||
| "annotation_level": "failure", | ||
| "message": "'"${ERROR_MESSAGE}"'", | ||
| "title": "Operator deployment failed" | ||
| }] | ||
| } | ||
| }' | jq -r '.id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Unescaped ERROR_MESSAGE in JSON payload; hard-coded line numbers will become stale.
Line 536 interpolates ERROR_MESSAGE directly into JSON without escaping quotes or newlines. If the error message contains " or newline characters, the JSON will be malformed. Additionally, hard-coded line numbers (357, 425 at lines 533-534) are brittle; they must be updated whenever the file structure changes.
Apply proper JSON escaping and use dynamic line references:
"annotations": [{
"path": ".github/workflows/container-validation-backends.yml",
- "start_line": 357,
- "end_line": 425,
+ "start_line": ${{ steps.deploy-operator-step.step_id_line_start || 357 }},
+ "end_line": ${{ steps.deploy-operator-step.step_id_line_end || 425 }},
"annotation_level": "failure",
- "message": "'"${ERROR_MESSAGE}"'",
+ "message": "$(printf '%s\n' "${ERROR_MESSAGE}" | jq -Rs .)",
"title": "Operator deployment failed"
}]Better yet, use a shell utility to properly escape JSON:
ERROR_MESSAGE_JSON=$(printf '%s\n' "${ERROR_MESSAGE}" | jq -Rs .)
# Then embed ERROR_MESSAGE_JSON in the payload| cat > annotation.json <<EOF | ||
| { | ||
| "title": "Deployment Test Failed: ${{ env.FRAMEWORK }} (${{ matrix.profile }})", | ||
| "summary": "Deployment test failed for ${{ env.FRAMEWORK }} with profile ${{ matrix.profile }}", | ||
| "text": "**Job**: ${{ github.job }}\n**Framework**: ${{ env.FRAMEWORK }}\n**Profile**: ${{ matrix.profile }}\n**Namespace**: ${{ needs.deploy-operator.outputs.NAMESPACE }}\n\n**Error Details**:\n\`\`\`\n${ERROR_MESSAGE}\n\`\`\`\n\n[View Job Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})", | ||
| "annotation_level": "failure", | ||
| "file": ".github/workflows/container-validation-backends.yml", | ||
| "start_line": 426, | ||
| "end_line": 618 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused annotation.json file; hard-coded line numbers.
Lines 721-730 construct an annotation JSON file, but the subsequent API call (lines 734-756) uses an inline JSON payload instead. The annotation.json file is never used. Remove the dead code. Additionally, line numbers 426, 618 at lines 728-729 and 749-750 are hard-coded and will become stale.
Remove unused annotation.json construction and either:
- Use the file if intended:
cat annotation.json | jq - Or remove it entirely and rely on the inline payload
Also address the hard-coded line numbers as described in the deploy-operator annotation comment above.
🤖 Prompt for AI Agents
.github/workflows/container-validation-backends.yml lines 721-730: the script
builds an unused annotation.json and embeds hard-coded start_line/end_line
values; remove the dead cat > annotation.json block entirely (or if you intended
to use a file, replace the inline payload usage with piping annotation.json
through jq to the API), and eliminate or compute the hard-coded
"start_line"/"end_line" values (either remove those fields or derive them
dynamically from the job/context rather than using static numbers) so the
workflow no longer produces unused artifacts or stale line-number metadata.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit