Automated CI/CD failure analysis and remediation using AI agents.
This service receives GitHub workflow failure webhooks, analyzes the failure context using AI, and proposes or applies automated fixes to resolve build issues.
The project follows a clean architecture pattern with clear separation of concerns:
src/github_action_triage/
├── app/ # Application layer
│ ├── web/ # FastAPI routers and HTTP concerns
│ │ └── api.py # Webhook endpoints
│ ├── events/ # Domain models and events
│ │ ├── models.py # Event DTOs (WorkflowRunFailureEvent, etc.)
│ │ └── outcomes.py # Triage result enums
│ ├── config/ # Configuration management
│ │ └── settings.py # Environment-based settings
│ ├── infra/ # Infrastructure adapters
│ │ ├── github_client.py # GitHub API integration
│ │ └── github_issue_creator.py # GitHub issue creation
│ ├── llm/ # LLM integrations
│ │ └── mcp.py # MCP client configuration
│ ├── api.py # Core triage service orchestration
│ ├── celery_app.py # Celery application configuration
│ └── factory.py # FastAPI application factory
├── agent/ # Agent layer
│ ├── analysis/ # Analysis agent implementation
│ │ ├── agent.py # Core analysis agent (pydantic-ai)
│ │ ├── config.py # Analysis agent configuration
│ │ ├── instructions.py # Agent instruction builders
│ │ └── tools/ # Agent tool integrations
│ │ ├── github.py # GitHub API tools
│ │ └── sourcegraph.py # Sourcegraph code search tools
│ ├── ports.py # Protocol definitions for external services
│ ├── config.py # Agent configuration
│ └── mcp.py # MCP tool integrations
└── tasks/ # Background task layer
└── triage.py # Celery tasks for async triage processing
- App Package: Webhook routing, event models, infrastructure adapters, LLM clients
- Agent Package: External service protocols, AI integrations, MCP tool configurations
- Tasks Package: Celery background tasks for asynchronous processing
- Ports Pattern: Protocol-based dependency injection for testability and flexibility
- GitHub webhook → FastAPI endpoint (
/github/webhook) - Webhook handler enqueues Celery task
- Returns 200 OK immediately
- Celery worker processes task asynchronously:
- Context gathering via
GitHubContextProvider - Diagnosis via
RemediationAgent(with MCP tools) - Comment posting or issue creation via
IssueCreator
- Context gathering via
- Python >=3.14
- uv package manager
# Clone the repository
git clone https://github.com/trly/github-action-triage.git
cd github-action-triage
# Initialize issue tracking (required for development)
bd onboard
# Install Python dependencies
uv sync --group dev
# Install the project in editable mode (required for imports)
uv pip install -e .
# Verify setup by running tests
uv run pytest# Start development server with auto-reload
uv run poe dev
# Or manually
uv run uvicorn main:app --reloadThe API will be available at:
- http://localhost:8000
- Docs: http://localhost:8000/docs
- Health check: http://localhost:8000/github/health
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/github_action_triage
# Run specific test file
uv run pytest tests/test_app_events.py -vThe service is configured via environment variables with the TRIAGE_ prefix:
export TRIAGE_GITHUB_APP_ID="123456"
export TRIAGE_GITHUB_PRIVATE_KEY="$(cat path/to/your-app.pem)"
export TRIAGE_GITHUB_WEBHOOK_SECRET="your-webhook-secret"
export TRIAGE_ANTHROPIC_API_KEY="sk-ant-..."
export TRIAGE_SOURCEGRAPH_TOKEN="sgp_..."
export TRIAGE_SOURCEGRAPH_MCP_URL="http://localhost:3000"
export TRIAGE_LOG_LEVEL="INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
export TRIAGE_DISABLE_ISSUE_CREATION="false" # Set to "true" for testing without creating issuesNotes:
-
TRIAGE_GITHUB_PRIVATE_KEYshould contain the full PEM content (including-----BEGIN RSA PRIVATE KEY-----and-----END RSA PRIVATE KEY-----lines), not just a file path. -
TRIAGE_GITHUB_WEBHOOK_SECRETshould be a secure random string. Generate one with: -
TRIAGE_DISABLE_ISSUE_CREATIONwhen set to"true", disables GitHub issue creation and instead logs the proposal. Useful for local testing and development to avoid cluttering repositories with test issues.# Generate a secure random secret openssl rand -hex 32 # Or use Ruby ruby -rsecurerandom -e 'puts SecureRandom.hex(32)' # Or use Python python3 -c 'import secrets; print(secrets.token_hex(32))'
Configure this same secret in your GitHub App webhook settings for signature verification.
The service is containerized using a multi-stage Docker build with a minimal Chainguard Python runtime.
# Build the container image
docker build -t github-action-triage:latest .
# Or with a specific tag
docker build -t ghcr.io/yourorg/github-action-triage:v1.0.0 .# Run with environment variables
docker run -d \
-p 8000:8000 \
-e TRIAGE_GITHUB_APP_ID="123456" \
-e TRIAGE_GITHUB_PRIVATE_KEY="$(cat path/to/your-app.pem)" \
-e TRIAGE_GITHUB_WEBHOOK_SECRET="your-webhook-secret" \
-e TRIAGE_ANTHROPIC_API_KEY="sk-ant-..." \
-e TRIAGE_SOURCEGRAPH_TOKEN="sgp_..." \
-e TRIAGE_SOURCEGRAPH_MCP_URL="http://localhost:3000" \
-e TRIAGE_LOG_LEVEL="INFO" \
--name github-action-triage \
github-action-triage:latestCreate a .env file with your configuration:
TRIAGE_GITHUB_APP_ID=123456
TRIAGE_GITHUB_PRIVATE_KEY=-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
TRIAGE_GITHUB_WEBHOOK_SECRET=your-webhook-secret
TRIAGE_ANTHROPIC_API_KEY=sk-ant-...
TRIAGE_SOURCEGRAPH_TOKEN=sgp_...
TRIAGE_SOURCEGRAPH_MCP_URL=http://localhost:3000
TRIAGE_LOG_LEVEL=INFOThen run:
docker run -d -p 8000:8000 --env-file .env --name github-action-triage github-action-triage:latestCreate a docker-compose.yml file:
services:
triage:
build: .
ports:
- "8000:8000"
env_file:
- .env
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/github/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40sStart the service:
docker compose up -d- Secrets Management: Use Docker secrets or a secrets manager (AWS Secrets Manager, HashiCorp Vault) instead of environment variables for sensitive data
- Logging: Container logs are sent to stdout/stderr; configure log aggregation (Datadog, CloudWatch, etc.)
- Monitoring: Expose
/github/healthendpoint for health checks and load balancer integration - Resource Limits: Set memory and CPU limits in production:
docker run -d -p 8000:8000 --memory="512m" --cpus="1.0" --env-file .env github-action-triage:latest
- Security: The container uses the minimal Chainguard Python image for reduced attack surface
- Write tests first to specify behavior
- Implement to satisfy tests
- Ensure all tests pass:
uv run pytest - Verify server boots:
uv run poe dev