|
| 1 | +# Automatically Remediate AWS Control Tower Drift with the Async Multi-Account Factory Module |
| 2 | + |
| 3 | +## Background: The Problem of Drift in AWS Control Tower |
| 4 | + |
| 5 | +When managing AWS accounts via Control Tower and Service Catalog, you may encounter an issue where OpenTofu/Terraform detects drift in your infrastructure state. This is particularly common when: |
| 6 | + |
| 7 | +- A new version of the Account Factory Provisioning Artifact is published |
| 8 | +- You move an account between Organizational Units (OUs) |
| 9 | +- Manual changes are made in the AWS Console or via API |
| 10 | + |
| 11 | +In all of these cases, the `provisioned_product_id` changes behind the scenes, but OpenTofu/Terraform isn’t aware of it. When you next apply your infrastructure code, it attempts to reconcile this drift by updating every affected provisioned product, even if nothing else has changed. |
| 12 | + |
| 13 | +This becomes a major problem at scale: |
| 14 | + |
| 15 | +- The update process is slow, especially for large organizations |
| 16 | +- AWS imposes a hard limit of 5 concurrent updates, so you're throttled quickly |
| 17 | +- OpenTofu/Terraform updates can take hours to complete |
| 18 | +- You risk timeouts, failed updates, and broken infrastructure state |
| 19 | + |
| 20 | +## The Fix: Introducing the Async Multi-Account Factory Module |
| 21 | + |
| 22 | +To solve this, we’ve introduced a new module: control-tower-multi-account-factory-async |
| 23 | + |
| 24 | +Instead of managing `provisioned_product_id` drift directly via OpenTofu/Terraform, this module uses an asynchronous workflow built from AWS native services: |
| 25 | + |
| 26 | +|Component | Role | |
| 27 | +| -- | -- | |
| 28 | +|EventBridge Rule | Listens for Service Catalog API calls like UpdateProvisioningArtifact and UpdateProvisionedProduct | |
| 29 | +|Ingest Lambda | Finds outdated provisioned products and queues them for update | |
| 30 | +|SQS FIFO Queue | Stores update jobs with strict ordering and deduplication | |
| 31 | +|Worker Lambda | Applies the update and launches Step Functions | |
| 32 | +|AWS Step Functions state machine | Monitors the update process and confirms success or failure | |
| 33 | + |
| 34 | + |
| 35 | +This async approach operates as follows: |
| 36 | + |
| 37 | + |
| 38 | +```mermaid |
| 39 | +flowchart TD |
| 40 | +
|
| 41 | + %% Define reusable terminator nodes |
| 42 | + X((Lambda Ends)) |
| 43 | + Y((Lambda Ends)) |
| 44 | +
|
| 45 | + %% Trigger & Event Rule |
| 46 | + A[User/API triggers UpdateProvisioningArtifact or UpdateProvisionedProduct] |
| 47 | + A --> B[EventBridge Rule] |
| 48 | + B --> C[Ingest Lambda] |
| 49 | +
|
| 50 | + %% Product Identification & Queue |
| 51 | + C -->|Find affected provisioned products| D[Affected Products List] |
| 52 | + D -->|Queue updates| E[SQS FIFO Queue] |
| 53 | +
|
| 54 | + %% Ingest exits |
| 55 | + C --> Y((Lambda Ends)) |
| 56 | +
|
| 57 | + %% Worker & Initial Actions |
| 58 | + E -->|Trigger| F[Worker Lambda] |
| 59 | + F -->|UpdateProvisionedProduct| G[Service Catalog] |
| 60 | + F -->|StartExecution| H[AWS Step Functions state machine] |
| 61 | +
|
| 62 | + %% Worker exits |
| 63 | + F --> X((Lambda Ends)) |
| 64 | +
|
| 65 | + %% AWS Step Functions handles status polling |
| 66 | + H -->|DescribeRecord loop| G |
| 67 | + G -->|Status| H |
| 68 | + H -->|Success/Failure| I[End] |
| 69 | +
|
| 70 | + %% Rate limiting logic |
| 71 | + F -->|Rate limit hit| R[Re-queued to FIFO Queue] |
| 72 | + R --> E |
| 73 | +
|
| 74 | + %% DLQ path |
| 75 | + E -->|Max retries reached| J[Dead Letter Queue] |
| 76 | +``` |
| 77 | + |
| 78 | +Why is this better? |
| 79 | + |
| 80 | +- Drift is resolved outside OpenTofu/Terraform |
| 81 | +- Updates happen automatically, with no user action |
| 82 | +- Concurrency is controlled to avoid throttling |
| 83 | +- Your OpenTofu/Terraform applies stay fast and clean |
| 84 | + |
| 85 | +## Step-by-Step: Switching to the Async Module |
| 86 | + |
| 87 | +1. Update your terragrunt.hcl to use the new module |
| 88 | + |
| 89 | +Replace this: |
| 90 | + |
| 91 | +```hcl |
| 92 | +terraform { |
| 93 | + source = "[email protected]:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory?ref=VERSION" |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +With this: |
| 98 | + |
| 99 | +```hcl |
| 100 | +terraform { |
| 101 | + source = "[email protected]:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory-async?ref=VERSION" |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +_Note: No state migration is needed — this is a drop-in replacement._ |
| 106 | + |
| 107 | +2. Apply your changes |
| 108 | + |
| 109 | +Run `terragrunt apply` either directly or through GitHub Actions. This will deploy: |
| 110 | + |
| 111 | +- The new Lambda functions |
| 112 | +- SQS FIFO queue + DLQ |
| 113 | +- EventBridge rules for Service Catalog API monitoring |
| 114 | +- AWS Step Functions state machine |
| 115 | + |
| 116 | +After apply, drifted `provisioned_product_id` values will be remediated whenever [UpdateProvisioningArtifact](https://docs.aws.amazon.com/servicecatalog/latest/dg/API_UpdateProvisioningArtifact.html) or [UpdateProvisionedProduct](https://docs.aws.amazon.com/servicecatalog/latest/dg/API_UpdateProvisionedProduct.html) API calls occur. |
| 117 | + |
| 118 | + |
| 119 | +_Note: If your environment is already in a drifted state, you will need to trigger UpdateProvisioningArtifact or UpdateProvisionedProduct to initiate drift remediation. The simplest way to do this is to deactivate and reactivate the current provisioning artifact version._ |
| 120 | + |
| 121 | +## Optional: Control Concurrency with lambda_worker_max_concurrent_operations |
| 122 | + |
| 123 | +AWS Service Catalog currently enforces a [hard limit of 5 account-related operations concurrently](https://docs.aws.amazon.com/controltower/latest/userguide/provision-and-manage-accounts.html#:~:text=You%20can%20perform%20up%20to%20five%20(5)%20account%2Drelated%20operations%20concurrently%2C%20including%20provisioning%2C%20updating%2C%20and%20enrolling.) that includes provisioning, updating, and enrolling. Exceeding this limit may result in throttling errors or failed updates. |
| 124 | + |
| 125 | +To avoid hitting that limit (and prevent failed updates), you can configure the number of concurrent updates with the `lambda_worker_max_concurrent_operations` variable. Example: |
| 126 | + |
| 127 | +```hcl |
| 128 | +inputs = { |
| 129 | + lambda_worker_max_concurrent_operations = X |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +This variable tells the worker Lambda to never initiate more than X updates at a time, which can be used to leave headroom for other processes (like provisioning new accounts) to succeed. |
| 134 | + |
| 135 | +|Value | Behavior | |
| 136 | +| -- | -- | |
| 137 | +|`5` | Max concurrency allowed by AWS (use with caution) | |
| 138 | +|`<5` | Safe concurrency with headroom for other ops | |
| 139 | +|`1` | Serialized updates, safest but slowest | |
0 commit comments