feat: add error message propogation #4195

nv-nmailhot · 2025-11-07T23:09:59Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Enhanced CI/CD workflow robustness with improved error detection, logging, and diagnostic capture capabilities. Deployment failures now generate detailed annotations and diagnostic information for faster issue resolution.

coderabbitai · 2025-11-07T23:14:17Z

Walkthrough

This workflow update adds comprehensive error handling, logging, and GitHub annotations to the container validation pipeline. It introduces Helm availability checks, captures diagnostic information on failures, and propagates detailed error messages to GitHub via check-runs and inline annotations.

Changes

Cohort / File(s)	Summary
Logging & Output Capture `.github/workflows/container-validation-backends.yml`	Added output redirection to `deploy-operator.log` and `test-output.log` for structured logging while maintaining console output.
Helm Verification & Setup `.github/workflows/container-validation-backends.yml`	Introduced Helm availability checks, repository addition with error handling, and dependency build verification with failure detection.
Deployment Error Handling `.github/workflows/container-validation-backends.yml`	Wrapped Helm chart install and rollout operations in conditionals; on failure, captures pod status, events, deployments, and Helm status as diagnostics.
Test Execution & Validation `.github/workflows/container-validation-backends.yml`	Enhanced vLLM deploy-test flow with error messaging, diagnostic collection, and JSON validation with specific error categorization.
GitHub Annotations & Error Reporting `.github/workflows/container-validation-backends.yml`	Added check-run creation via GitHub API and inline error annotations with detailed payloads including file location, line numbers, and error context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:
- GitHub API check-run payload structure and annotation formatting to ensure compatibility with GitHub Actions
- Error message propagation across multiple conditional branches to verify consistency and correctness
- Diagnostic capture logic (pod status, events, deployments) for proper shell command execution and error handling
- Helm command chaining and error state management through the deployment pipeline

Poem

🐰 With logs now flowing like morning dew,
And Helm checks catching issues too,
Errors annotated in GitHub's view,
Diagnostics captured, debugging's through!
This workflow's robust, tested and true! 🛡️✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only template placeholders with no concrete information about the changes, objectives, or related issues filled in.	Fill in the template sections with specific details about error message propagation implementation, affected files, and link the related GitHub issue number instead of using placeholder #xxx.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title refers to error message propagation, which matches the AI-generated summary's core focus on adding error handling and context propagation throughout the workflow.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/container-validation-backends.yml (1)
584-694: Variable naming inconsistency and missing log artifact preservation in test step.

Line 669 introduces ERROR_MSG="" but line 691 uses ERROR_MESSAGE=. Additionally, test-output.log is created on line 594 via tee -a but is never uploaded as a workflow artifact. This log will be lost after the job completes. Consider uploading logs as artifacts for post-job diagnostics.

Standardize variable names and add log artifact uploads:
# After the test run completes, upload logs
- name: Upload Test Logs
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: test-output-logs-${{ matrix.profile }}
    path: test-output.log
    retention-days: 7

♻️ Duplicate comments (1)

.github/workflows/container-validation-backends.yml (1)

470-483: Unescaped diagnostic output in GITHUB_ENV (same issue as helm install section).

Diagnostic outputs from kubectl get commands are concatenated into ERROR_MESSAGE without escaping. This is a continuation of the issue flagged in the helm install section above. Consider a more robust approach for preserving diagnostic context.

🧹 Nitpick comments (2)

.github/workflows/container-validation-backends.yml (2)
486-548: Significant code duplication between deploy-operator and test failure annotation steps.

Both steps follow nearly identical patterns: extract ERROR_MESSAGE, gather diagnostics via kubectl/jq, create GitHub check-run, emit error annotation, exit. This is a maintenance liability—future changes to error handling must be duplicated across both steps.

Consider extracting into a reusable composite action (.github/actions/create-failure-annotation/action.yml) that accepts error message, file path, and line ranges, then calls the composite from both jobs.

Also applies to: 696-764

375-375: Log files created via tee but not preserved as artifacts.

Both deploy-operator.log (line 375) and test-output.log (line 594) are valuable for post-failure diagnostics but are discarded after the job completes. Upload these as artifacts.

Add upload-artifact steps in the Cleanup sections or at the end of each job:
- name: Upload Operator Deployment Logs
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: deploy-operator-logs-${{ github.run_id }}
    path: deploy-operator.log
    retention-days: 7
Also applies to: 594-594

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1cf3c2 and 959ec2b.

📒 Files selected for processing (1)

.github/workflows/container-validation-backends.yml (5 hunks)

🧰 Additional context used

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4195/merge) by nv-nmailhot.

.github/workflows/container-validation-backends.yml

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. Fix reproduced by running pre-commit locally.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: operator (arm64)
GitHub Check: operator (amd64)
GitHub Check: sglang (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (3)

.github/workflows/container-validation-backends.yml (3)

417-444: Error handling for Helm setup is well-structured.

The progressive validation (availability → repo → dependencies) with early exit and message propagation is solid.

484-484: Correct error capture pattern with continue-on-error.

The continue-on-error: true combined with conditional step (if: steps.deploy-operator-step.outcome == 'failure') properly enables error context capture without masking the failure.

Also applies to: 486-488

496-515: Namespace context may be unavailable if deploy-operator fails early.

The code checks if [ -n "$NAMESPACE" ] before using it, which is good. However, if the deploy-operator step fails before NAMESPACE is output (line 380), the annotation step will skip diagnostic context gathering. Consider exporting NAMESPACE earlier or passing it explicitly between steps.

Verify whether all code paths in the deploy-operator step establish NAMESPACE before any exit statement.

coderabbitai · 2025-11-07T23:14:20Z

.github/workflows/container-validation-backends.yml

+        
+        # Redirect all output to a log file while still showing it
+        exec > >(tee -a deploy-operator.log) 2>&1

        # Set namespace using branch
        BRANCH_SANITIZED="${BRANCH/\//-}"
        NAMESPACE="gh-job-id-${{ github.run_id }}-${BRANCH_SANITIZED}-deploy-tests"
        echo "namespace=${NAMESPACE}" >> "$GITHUB_OUTPUT"
+        echo "NAMESPACE=${NAMESPACE}" >> $GITHUB_ENV


⚠️ Potential issue | 🟡 Minor

Fix trailing whitespace from pre-commit hook.

The pipeline failure indicates trailing whitespace exists in this section. Run pre-commit run --all-files locally to identify and fix all instances before committing.

🤖 Prompt for AI Agents

.github/workflows/container-validation-backends.yml around lines 373 to 381: there are trailing whitespace characters in this block (the redirect and namespace export lines) causing the pre-commit failure — remove the trailing spaces at the end of the affected lines (ensure no extra spaces after each line), run `pre-commit run --all-files` locally to verify the hook passes, and commit the fixed file.

coderabbitai · 2025-11-07T23:14:20Z

.github/workflows/container-validation-backends.yml

+        echo "Installing dynamo-platform Helm chart..."
+        if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \
          --set dynamo-operator.namespaceRestriction.enabled=true \
          --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \
          --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \
          --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \
-          --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
+          --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then
+          ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints."
+          echo "$ERROR_MSG"
+          
+          # Capture additional diagnostics
+          echo "=== Pod Status ==="
+          kubectl get pods -n ${NAMESPACE} -o wide || true
+          echo "=== Events ==="
+          kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true
+          echo "=== Helm Status ==="
+          helm status dynamo-platform -n ${NAMESPACE} || true
+          
+          echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV
+          exit 1
+        fi


⚠️ Potential issue | 🟠 Major

Variable naming inconsistency and potential unsafe output concatenation.

Line 454 uses ERROR_MSG instead of ERROR_MESSAGE (used elsewhere). More critically, diagnostic outputs (pods, events) are concatenated into ERROR_MESSAGE without sanitization before being stored in GITHUB_ENV. If any output contains newlines or special characters, this could corrupt the environment variable.

Standardize the variable name and escape diagnostic output:

if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \ --set dynamo-operator.namespaceRestriction.enabled=true \ --set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \ --set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \ --set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \ --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then - ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints." + ERROR_MESSAGE="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints." echo "$ERROR_MSG"

For the diagnostic capture, consider logging to a file artifact instead of concatenating into a single environment variable.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "Installing dynamo-platform Helm chart..."

if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \

--set dynamo-operator.namespaceRestriction.enabled=true \

--set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \

--set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \

--set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \

--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret

--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then

ERROR_MSG="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints."

echo "$ERROR_MSG"

# Capture additional diagnostics

echo "=== Pod Status ==="

kubectl get pods -n ${NAMESPACE} -o wide || true

echo "=== Events ==="

kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true

echo "=== Helm Status ==="

helm status dynamo-platform -n ${NAMESPACE} || true

echo "ERROR_MESSAGE=$ERROR_MSG" >> $GITHUB_ENV

exit 1

fi

echo "Installing dynamo-platform Helm chart..."

if ! helm upgrade --install dynamo-platform . --namespace ${NAMESPACE} \

--set dynamo-operator.namespaceRestriction.enabled=true \

--set dynamo-operator.namespaceRestriction.allowedNamespaces[0]=${NAMESPACE} \

--set dynamo-operator.controllerManager.manager.image.repository=${{ secrets.AZURE_ACR_HOSTNAME }}/ai-dynamo/dynamo \

--set dynamo-operator.controllerManager.manager.image.tag=${{ github.sha }}-operator-amd64 \

--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret 2>&1; then

ERROR_MESSAGE="Failed to install dynamo-platform Helm chart. This may be due to: pre-install hook timeout, image pull failures, or resource constraints."

echo "$ERROR_MESSAGE"

# Capture additional diagnostics

echo "=== Pod Status ==="

kubectl get pods -n ${NAMESPACE} -o wide || true

echo "=== Events ==="

kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20 || true

echo "=== Helm Status ==="

helm status dynamo-platform -n ${NAMESPACE} || true

echo "ERROR_MESSAGE=$ERROR_MESSAGE" >> $GITHUB_ENV

exit 1

fi

🤖 Prompt for AI Agents

In .github/workflows/container-validation-backends.yml around lines 447 to 467, replace the inconsistent ERROR_MSG with ERROR_MESSAGE and stop writing raw multi-line diagnostic output into GITHUB_ENV; instead capture the pod/events/helm outputs to separate files (or a single diagnostics file) and upload them as workflow artifacts (or base64-encode/sanitize before exporting) while only writing a single-line, safe ERROR_MESSAGE to GITHUB_ENV (e.g., "Failed to install dynamo-platform; diagnostics saved to <path>" or the base64-encoded diagnostics string) to avoid corrupting the environment file.

coderabbitai · 2025-11-07T23:14:20Z

.github/workflows/container-validation-backends.yml

+          POD_STATUS=$(kubectl get pods -n $NAMESPACE -o json 2>/dev/null || echo "{}")
+          POD_ERRORS=$(echo "$POD_STATUS" | jq -r '.items[] | select(.status.phase != "Running") | "Pod: \(.metadata.name), Status: \(.status.phase), Reason: \(.status.containerStatuses[0].state.waiting.reason // .status.reason // "N/A"), Message: \(.status.containerStatuses[0].state.waiting.message // "N/A")"' 2>/dev/null | head -10 || echo "")
+          
+          if [ -n "$POD_ERRORS" ]; then
+            ERROR_MESSAGE="$ERROR_MESSAGE\n\nPod Status:\n$POD_ERRORS"
+          fi
+          
+          # Get recent events
+          EVENTS=$(kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' -o json 2>/dev/null || echo "{}")
+          ERROR_EVENTS=$(echo "$EVENTS" | jq -r '.items[] | select(.type == "Warning" or .type == "Error") | "\(.lastTimestamp) - \(.reason): \(.message)"' 2>/dev/null | tail -5 || echo "")
+          
+          if [ -n "$ERROR_EVENTS" ]; then
+            ERROR_MESSAGE="$ERROR_MESSAGE\n\nRecent Events:\n$ERROR_EVENTS"
+          fi
+        fi


⚠️ Potential issue | 🟠 Major

Diagnostic outputs appended to ERROR_MESSAGE without escaping.

Pod and event information are concatenated into ERROR_MESSAGE (lines 505, 513) without escaping newlines or special characters. This exacerbates the JSON escaping issue flagged earlier. When ERROR_MESSAGE is later embedded in the GitHub API payload, these unescaped values will corrupt the JSON.

Consider building a structured diagnostic object separately or properly escaping before concatenation.

coderabbitai · 2025-11-07T23:14:20Z

.github/workflows/container-validation-backends.yml

+        # Create a check run with the annotation
+        CHECK_RUN_ID=$(curl -s -X POST \
+          -H "Authorization: token $GITHUB_TOKEN" \
+          -H "Accept: application/vnd.github.v3+json" \
+          "https://api.github.com/repos/${{ github.repository }}/check-runs" \
+          -d '{
+            "name": "Deploy Operator",
+            "head_sha": "${{ github.sha }}",
+            "status": "completed",
+            "conclusion": "failure",
+            "output": {
+              "title": "Operator Deployment Failed",
+              "summary": "Failed to deploy dynamo-platform operator to namespace '"${NAMESPACE}"'",
+              "text": "**Job**: deploy-operator\n**Namespace**: '"${NAMESPACE}"'\n\n**Error Details**:\n```\n'"${ERROR_MESSAGE}"'\n```\n\n[View Job Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})",
+              "annotations": [{
+                "path": ".github/workflows/container-validation-backends.yml",
+                "start_line": 357,
+                "end_line": 425,
+                "annotation_level": "failure",
+                "message": "'"${ERROR_MESSAGE}"'",
+                "title": "Operator deployment failed"
+              }]
+            }
+          }' | jq -r '.id')


⚠️ Potential issue | 🔴 Critical

Critical: Unescaped ERROR_MESSAGE in JSON payload; hard-coded line numbers will become stale.

Line 536 interpolates ERROR_MESSAGE directly into JSON without escaping quotes or newlines. If the error message contains " or newline characters, the JSON will be malformed. Additionally, hard-coded line numbers (357, 425 at lines 533-534) are brittle; they must be updated whenever the file structure changes.

Apply proper JSON escaping and use dynamic line references:

"annotations": [{ "path": ".github/workflows/container-validation-backends.yml", - "start_line": 357, - "end_line": 425, + "start_line": ${{ steps.deploy-operator-step.step_id_line_start || 357 }}, + "end_line": ${{ steps.deploy-operator-step.step_id_line_end || 425 }}, "annotation_level": "failure", - "message": "'"${ERROR_MESSAGE}"'", + "message": "$(printf '%s\n' "${ERROR_MESSAGE}" | jq -Rs .)", "title": "Operator deployment failed" }]

Better yet, use a shell utility to properly escape JSON:

ERROR_MESSAGE_JSON=$(printf '%s\n' "${ERROR_MESSAGE}" | jq -Rs .) # Then embed ERROR_MESSAGE_JSON in the payload

coderabbitai · 2025-11-07T23:14:21Z

.github/workflows/container-validation-backends.yml

+        cat > annotation.json <<EOF
+        {
+          "title": "Deployment Test Failed: ${{ env.FRAMEWORK }} (${{ matrix.profile }})",
+          "summary": "Deployment test failed for ${{ env.FRAMEWORK }} with profile ${{ matrix.profile }}",
+          "text": "**Job**: ${{ github.job }}\n**Framework**: ${{ env.FRAMEWORK }}\n**Profile**: ${{ matrix.profile }}\n**Namespace**: ${{ needs.deploy-operator.outputs.NAMESPACE }}\n\n**Error Details**:\n\`\`\`\n${ERROR_MESSAGE}\n\`\`\`\n\n[View Job Run](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})",
+          "annotation_level": "failure",
+          "file": ".github/workflows/container-validation-backends.yml",
+          "start_line": 426,
+          "end_line": 618
+        }


⚠️ Potential issue | 🟠 Major

Unused annotation.json file; hard-coded line numbers.

Lines 721-730 construct an annotation JSON file, but the subsequent API call (lines 734-756) uses an inline JSON payload instead. The annotation.json file is never used. Remove the dead code. Additionally, line numbers 426, 618 at lines 728-729 and 749-750 are hard-coded and will become stale.

Remove unused annotation.json construction and either:

Use the file if intended: cat annotation.json | jq

Or remove it entirely and rely on the inline payload

Also address the hard-coded line numbers as described in the deploy-operator annotation comment above.

🤖 Prompt for AI Agents

.github/workflows/container-validation-backends.yml lines 721-730: the script builds an unused annotation.json and embeds hard-coded start_line/end_line values; remove the dead cat > annotation.json block entirely (or if you intended to use a file, replace the inline payload usage with piping annotation.json through jq to the API), and eliminate or compute the hard-coded "start_line"/"end_line" values (either remove those fields or derive them dynamically from the job/context rather than using static numbers) so the workflow no longer produces unused artifacts or stale line-number metadata.

add error message propogation

959ec2b

nv-nmailhot requested a review from a team as a code owner November 7, 2025 23:10

pull-request-size bot added the size/L label Nov 7, 2025

github-actions bot added the feat label Nov 7, 2025

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add error message propogation #4195

feat: add error message propogation #4195

Uh oh!

nv-nmailhot commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add error message propogation #4195

Are you sure you want to change the base?

feat: add error message propogation #4195

Uh oh!

Conversation

nv-nmailhot commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Nov 7, 2025 •

edited by coderabbitai bot

Loading