-
Notifications
You must be signed in to change notification settings - Fork 48
feat: Implement CDK-based AWS resource cleanup Lambda #3620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
nogueiraanderson
wants to merge
14
commits into
master
Choose a base branch
from
feature/cdk-enhanced-ec2-cleanup
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Comprehensive AWS resource cleanup Lambda with CDK deployment to replace CloudFormation-based implementation that exceeded 51KB template size limit. Architecture changes: - Migrate from inline CloudFormation to AWS CDK + Python - Rename ec2_cleanup to aws_resource_cleanup (accurate scope) - Modular code structure: 29 files, 2,338 additions - Python 3.12 runtime with modern type annotations Features: - EC2 cleanup: TTL expiration, stop policies, long-stopped detection, untagged instances - EKS cleanup: CloudFormation stack deletion with skip patterns - OpenShift cleanup: Full cluster destruction with reconciliation loop - Billing tag validation: Category strings and Unix timestamps - DRY_RUN mode: Default safe preview mode Infrastructure: - CDK stack with 7 configurable parameters - Lambda: 1024MB memory, 600s timeout, hourly EventBridge schedule - Comprehensive IAM permissions for EC2, EKS, ELB, Route53, S3, VPC - SNS notifications for cleanup actions Developer experience: - Justfile automation: 26+ commands for deployment, monitoring, maintenance - uv package manager: No manual venv activation required - Linters: ruff, black, mypy with full type coverage - Root IaC/justfile for multi-project routing Configuration via environment variables: - DRY_RUN (default: true) - UNTAGGED_THRESHOLD_MINUTES (default: 30) - EKS_SKIP_PATTERN (default: pe-.*) - OPENSHIFT_CLEANUP_ENABLED (default: true) - OPENSHIFT_BASE_DOMAIN (default: cd.percona.com) - OPENSHIFT_MAX_RETRIES (default: 3) Replaces: - IaC/LambdaEC2Cleanup.yml (10-min simple cleanup) - cloud/aws-functions/orphaned_*.py scripts - Manual OpenShift destruction workflows
bc12d8c
to
d5f7df3
Compare
Add comprehensive tests for critical execution paths: - Action execution for all types (TERMINATE, STOP, TERMINATE_CLUSTER, TERMINATE_OPENSHIFT_CLUSTER) - Region cleanup orchestration - CirrusCI auto-tagging - SNS notification handling - Error handling and edge cases Coverage improvements: - ec2/instances.py: 27% -> 98% - handler.py: 18% -> 72% - Overall: 35% -> 49% Test additions: - Execute cleanup actions in DRY_RUN and live modes - OpenShift cleanup enabled/disabled scenarios - Invalid action handling - SNS notification with/without topic - ClientError exception handling
Add missing 'volumes' pytest marker and fix test that incorrectly expected a delete action for a volume with a valid billing tag. The volume protection logic correctly identified the billing tag as valid, so the test was updated to remove the billing tag. All 176 tests now pass with 88% code coverage.
- Standardize logging across all components (EC2, volumes, OpenShift, EKS) - Consistent protection messages with resource IDs and reasons - Region-level scan summaries with protection breakdown - Volume age statistics in final summary - Add CDK parameters for operational control - ScheduleRateMinutes: configurable execution frequency - TargetRegions: filter specific regions or scan all - LogLevel: DEBUG/INFO/WARNING/ERROR verbosity control - LogRetentionDays: CloudWatch log retention configuration - Improve parameter descriptions with category tags and impact statements - Make justfile dynamically resolve Lambda function name from CDK outputs - Add LambdaFunctionName to CloudFormation outputs for alignment - Fix type annotations for mypy compliance (handler.py protection tuples) - Update tests to handle tuple returns from protection functions All 176 tests pass with 87% coverage.
Reduce README from 323 to 95 lines for better scannability: - Condense features to bullet points - Show only 8 key parameters (down from 15+) - Simplify Quick Start to essential commands - Replace verbose logging section with brief examples - Condense troubleshooting to 3 quick tips - Simplify architecture to single-line diagram Focus on quick scanning and immediate action.
Upgrade Lambda runtime and all dependencies to latest versions: Runtime: - Lambda runtime: Python 3.12 → Python 3.13 (latest, AL2023-based) CDK & Infrastructure: - aws-cdk-lib: 2.150.0 → 2.220.0 - constructs: 10.0.0 → 10.4.2 - awscli: 1.32.0 → 1.42.53 AWS SDK: - boto3: 1.34.0 → 1.40.53 - botocore: 1.34.0 → 1.40.53 Testing: - pytest: 7.4.0 → 8.4.2 - pytest-cov: 4.1.0 → 7.0.0 - moto: 5.1.14 (was 4.2.0) Code Quality: - ruff: 0.1.0 → 0.14.0 - black: 23.0.0 → 25.9.0 (2025 stable style) - mypy: 1.7.0 → 1.18.2 All tests passing (176 tests, 87% coverage). Lambda successfully deployed and tested with Python 3.13 runtime.
- Add mock_lambda_context fixture to provide proper Lambda context mock - Fix e2e tests that were passing None as context parameter - Fix integration test DRY_RUN patch to use correct module location - Update justfile to run tests with Python 3.13 matching Lambda runtime - All 172 tests now passing with 85% coverage
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Migrates AWS resource cleanup from inline CloudFormation to CDK to solve 51KB template size limit. Expands scope from EC2-only to comprehensive cleanup (EC2, EKS, OpenShift).
Changes
Combined Cleanup Policies (evaluated in priority order):
TTL Expiration
creation-time
+delete-cluster-after-hours
tag policiesStop Policy
stop-after-days
tag for PMM staging instancesLong-Stopped Detection
Untagged Instance Cleanup
iit-billing-tag
presence and formatpmm-staging
,CirrusCI
) or Unix timestamp expirationProtected Resources:
jenkins-*
,pmm-dev
pe-.*
)Cluster Cleanup:
eksctl-*-cluster
), handles DELETE_FAILED retriesDeveloper Tools:
just deploy
,just logs
,just update-code
,just lint
uv
for dependency management (no venv activation)Deployment
Already deployed in us-east-2 as
AWSResourcesCleanupStack
(DRY_RUN mode).Replaces:
IaC/LambdaEC2Cleanup.yml
,cloud/aws-functions/orphaned_*.py
, manual OpenShift workflows