Skip to content

Conversation

nogueiraanderson
Copy link
Contributor

@nogueiraanderson nogueiraanderson commented Oct 15, 2025

Summary

Migrates AWS resource cleanup from inline CloudFormation to CDK to solve 51KB template size limit. Expands scope from EC2-only to comprehensive cleanup (EC2, EKS, OpenShift).

Changes

Combined Cleanup Policies (evaluated in priority order):

  1. TTL Expiration

    • Enforces creation-time + delete-cluster-after-hours tag policies
    • Terminates instances when TTL expires
    • Automatically detects cluster type (EKS vs OpenShift) and triggers appropriate cleanup
    • Common for short-lived test clusters (8h, 24h policies)
  2. Stop Policy

    • Honors stop-after-days tag for PMM staging instances
    • Stops (not terminates) instances after specified days to reduce costs
    • Only applies to running instances
  3. Long-Stopped Detection

    • Terminates instances in stopped state for >30 days
    • Recovers EBS storage costs while preserving recent stopped instances
  4. Untagged Instance Cleanup

    • Validates iit-billing-tag presence and format
    • Supports category strings (pmm-staging, CirrusCI) or Unix timestamp expiration
    • Configurable grace period (default: 30 minutes) for newly launched instances
    • Prevents cost leakage from untracked resources

Protected Resources:

  • Persistent billing tags: jenkins-*, pmm-dev
  • Valid future timestamps on billing tags
  • Instances matching EKS skip patterns (pe-.*)

Cluster Cleanup:

  • EKS: Deletes CloudFormation stacks (eksctl-*-cluster), handles DELETE_FAILED retries
  • OpenShift: Full infrastructure destruction in dependency order (ELB → NAT → ENI → Security Groups → Subnets → VPC), reconciliation loop retries stubborn resources

Developer Tools:

  • Justfile: just deploy, just logs, just update-code, just lint
  • uv for dependency management (no venv activation)
  • Modern Python type annotations with mypy

Deployment

cd IaC/cdk/aws-resources-cleanup
just install && just bootstrap && just deploy

Already deployed in us-east-2 as AWSResourcesCleanupStack (DRY_RUN mode).

Replaces: IaC/LambdaEC2Cleanup.yml, cloud/aws-functions/orphaned_*.py, manual OpenShift workflows

@nogueiraanderson nogueiraanderson requested a review from a team as a code owner October 15, 2025 07:04
Comprehensive AWS resource cleanup Lambda with CDK deployment to replace
CloudFormation-based implementation that exceeded 51KB template size limit.

Architecture changes:
- Migrate from inline CloudFormation to AWS CDK + Python
- Rename ec2_cleanup to aws_resource_cleanup (accurate scope)
- Modular code structure: 29 files, 2,338 additions
- Python 3.12 runtime with modern type annotations

Features:
- EC2 cleanup: TTL expiration, stop policies, long-stopped detection, untagged instances
- EKS cleanup: CloudFormation stack deletion with skip patterns
- OpenShift cleanup: Full cluster destruction with reconciliation loop
- Billing tag validation: Category strings and Unix timestamps
- DRY_RUN mode: Default safe preview mode

Infrastructure:
- CDK stack with 7 configurable parameters
- Lambda: 1024MB memory, 600s timeout, hourly EventBridge schedule
- Comprehensive IAM permissions for EC2, EKS, ELB, Route53, S3, VPC
- SNS notifications for cleanup actions

Developer experience:
- Justfile automation: 26+ commands for deployment, monitoring, maintenance
- uv package manager: No manual venv activation required
- Linters: ruff, black, mypy with full type coverage
- Root IaC/justfile for multi-project routing

Configuration via environment variables:
- DRY_RUN (default: true)
- UNTAGGED_THRESHOLD_MINUTES (default: 30)
- EKS_SKIP_PATTERN (default: pe-.*)
- OPENSHIFT_CLEANUP_ENABLED (default: true)
- OPENSHIFT_BASE_DOMAIN (default: cd.percona.com)
- OPENSHIFT_MAX_RETRIES (default: 3)

Replaces:
- IaC/LambdaEC2Cleanup.yml (10-min simple cleanup)
- cloud/aws-functions/orphaned_*.py scripts
- Manual OpenShift destruction workflows
@nogueiraanderson nogueiraanderson force-pushed the feature/cdk-enhanced-ec2-cleanup branch from bc12d8c to d5f7df3 Compare October 15, 2025 07:12
Add comprehensive tests for critical execution paths:
- Action execution for all types (TERMINATE, STOP, TERMINATE_CLUSTER, TERMINATE_OPENSHIFT_CLUSTER)
- Region cleanup orchestration
- CirrusCI auto-tagging
- SNS notification handling
- Error handling and edge cases

Coverage improvements:
- ec2/instances.py: 27% -> 98%
- handler.py: 18% -> 72%
- Overall: 35% -> 49%

Test additions:
- Execute cleanup actions in DRY_RUN and live modes
- OpenShift cleanup enabled/disabled scenarios
- Invalid action handling
- SNS notification with/without topic
- ClientError exception handling
Add missing 'volumes' pytest marker and fix test that incorrectly
expected a delete action for a volume with a valid billing tag.
The volume protection logic correctly identified the billing tag
as valid, so the test was updated to remove the billing tag.

All 176 tests now pass with 88% code coverage.
- Standardize logging across all components (EC2, volumes, OpenShift, EKS)
  - Consistent protection messages with resource IDs and reasons
  - Region-level scan summaries with protection breakdown
  - Volume age statistics in final summary
- Add CDK parameters for operational control
  - ScheduleRateMinutes: configurable execution frequency
  - TargetRegions: filter specific regions or scan all
  - LogLevel: DEBUG/INFO/WARNING/ERROR verbosity control
  - LogRetentionDays: CloudWatch log retention configuration
- Improve parameter descriptions with category tags and impact statements
- Make justfile dynamically resolve Lambda function name from CDK outputs
- Add LambdaFunctionName to CloudFormation outputs for alignment
- Fix type annotations for mypy compliance (handler.py protection tuples)
- Update tests to handle tuple returns from protection functions

All 176 tests pass with 87% coverage.
@nogueiraanderson nogueiraanderson marked this pull request as draft October 15, 2025 22:17
Reduce README from 323 to 95 lines for better scannability:
- Condense features to bullet points
- Show only 8 key parameters (down from 15+)
- Simplify Quick Start to essential commands
- Replace verbose logging section with brief examples
- Condense troubleshooting to 3 quick tips
- Simplify architecture to single-line diagram

Focus on quick scanning and immediate action.
Upgrade Lambda runtime and all dependencies to latest versions:

Runtime:
- Lambda runtime: Python 3.12 → Python 3.13 (latest, AL2023-based)

CDK & Infrastructure:
- aws-cdk-lib: 2.150.0 → 2.220.0
- constructs: 10.0.0 → 10.4.2
- awscli: 1.32.0 → 1.42.53

AWS SDK:
- boto3: 1.34.0 → 1.40.53
- botocore: 1.34.0 → 1.40.53

Testing:
- pytest: 7.4.0 → 8.4.2
- pytest-cov: 4.1.0 → 7.0.0
- moto: 5.1.14 (was 4.2.0)

Code Quality:
- ruff: 0.1.0 → 0.14.0
- black: 23.0.0 → 25.9.0 (2025 stable style)
- mypy: 1.7.0 → 1.18.2

All tests passing (176 tests, 87% coverage).
Lambda successfully deployed and tested with Python 3.13 runtime.
- Add mock_lambda_context fixture to provide proper Lambda context mock
- Fix e2e tests that were passing None as context parameter
- Fix integration test DRY_RUN patch to use correct module location
- Update justfile to run tests with Python 3.13 matching Lambda runtime
- All 172 tests now passing with 85% coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant