Skip to content

Commit ddf5940

Browse files
authored
guide on using the new control-tower-multi-account-factory-async module (#2701)
* guide on using the new control-tower-multi-account-factory-async module * add to sidebar nav
1 parent 1797f32 commit ddf5940

File tree

2 files changed

+144
-0
lines changed

2 files changed

+144
-0
lines changed
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Automatically Remediate AWS Control Tower Drift with the Async Multi-Account Factory Module
2+
3+
## Background: The Problem of Drift in AWS Control Tower
4+
5+
When managing AWS accounts via Control Tower and Service Catalog, you may encounter an issue where OpenTofu/Terraform detects drift in your infrastructure state. This is particularly common when:
6+
7+
- A new version of the Account Factory Provisioning Artifact is published
8+
- You move an account between Organizational Units (OUs)
9+
- Manual changes are made in the AWS Console or via API
10+
11+
In all of these cases, the `provisioned_product_id` changes behind the scenes, but OpenTofu/Terraform isn’t aware of it. When you next apply your infrastructure code, it attempts to reconcile this drift by updating every affected provisioned product, even if nothing else has changed.
12+
13+
This becomes a major problem at scale:
14+
15+
- The update process is slow, especially for large organizations
16+
- AWS imposes a hard limit of 5 concurrent updates, so you're throttled quickly
17+
- OpenTofu/Terraform updates can take hours to complete
18+
- You risk timeouts, failed updates, and broken infrastructure state
19+
20+
## The Fix: Introducing the Async Multi-Account Factory Module
21+
22+
To solve this, we’ve introduced a new module: control-tower-multi-account-factory-async
23+
24+
Instead of managing `provisioned_product_id` drift directly via OpenTofu/Terraform, this module uses an asynchronous workflow built from AWS native services:
25+
26+
|Component | Role |
27+
| -- | -- |
28+
|EventBridge Rule | Listens for Service Catalog API calls like UpdateProvisioningArtifact and UpdateProvisionedProduct |
29+
|Ingest Lambda | Finds outdated provisioned products and queues them for update |
30+
|SQS FIFO Queue | Stores update jobs with strict ordering and deduplication |
31+
|Worker Lambda | Applies the update and launches Step Functions |
32+
|AWS Step Functions state machine | Monitors the update process and confirms success or failure |
33+
34+
35+
This async approach operates as follows:
36+
37+
38+
```mermaid
39+
flowchart TD
40+
41+
%% Define reusable terminator nodes
42+
X((Lambda Ends))
43+
Y((Lambda Ends))
44+
45+
%% Trigger & Event Rule
46+
A[User/API triggers UpdateProvisioningArtifact or UpdateProvisionedProduct]
47+
A --> B[EventBridge Rule]
48+
B --> C[Ingest Lambda]
49+
50+
%% Product Identification & Queue
51+
C -->|Find affected provisioned products| D[Affected Products List]
52+
D -->|Queue updates| E[SQS FIFO Queue]
53+
54+
%% Ingest exits
55+
C --> Y((Lambda Ends))
56+
57+
%% Worker & Initial Actions
58+
E -->|Trigger| F[Worker Lambda]
59+
F -->|UpdateProvisionedProduct| G[Service Catalog]
60+
F -->|StartExecution| H[AWS Step Functions state machine]
61+
62+
%% Worker exits
63+
F --> X((Lambda Ends))
64+
65+
%% AWS Step Functions handles status polling
66+
H -->|DescribeRecord loop| G
67+
G -->|Status| H
68+
H -->|Success/Failure| I[End]
69+
70+
%% Rate limiting logic
71+
F -->|Rate limit hit| R[Re-queued to FIFO Queue]
72+
R --> E
73+
74+
%% DLQ path
75+
E -->|Max retries reached| J[Dead Letter Queue]
76+
```
77+
78+
Why is this better?
79+
80+
- Drift is resolved outside OpenTofu/Terraform
81+
- Updates happen automatically, with no user action
82+
- Concurrency is controlled to avoid throttling
83+
- Your OpenTofu/Terraform applies stay fast and clean
84+
85+
## Step-by-Step: Switching to the Async Module
86+
87+
1. Update your terragrunt.hcl to use the new module
88+
89+
Replace this:
90+
91+
```hcl
92+
terraform {
93+
source = "[email protected]:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory?ref=VERSION"
94+
}
95+
```
96+
97+
With this:
98+
99+
```hcl
100+
terraform {
101+
source = "[email protected]:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory-async?ref=VERSION"
102+
}
103+
```
104+
105+
_Note: No state migration is needed — this is a drop-in replacement._
106+
107+
2. Apply your changes
108+
109+
Run `terragrunt apply` either directly or through GitHub Actions. This will deploy:
110+
111+
- The new Lambda functions
112+
- SQS FIFO queue + DLQ
113+
- EventBridge rules for Service Catalog API monitoring
114+
- AWS Step Functions state machine
115+
116+
After apply, drifted `provisioned_product_id` values will be remediated whenever [UpdateProvisioningArtifact](https://docs.aws.amazon.com/servicecatalog/latest/dg/API_UpdateProvisioningArtifact.html) or [UpdateProvisionedProduct](https://docs.aws.amazon.com/servicecatalog/latest/dg/API_UpdateProvisionedProduct.html) API calls occur.
117+
118+
119+
_Note: If your environment is already in a drifted state, you will need to trigger UpdateProvisioningArtifact or UpdateProvisionedProduct to initiate drift remediation. The simplest way to do this is to deactivate and reactivate the current provisioning artifact version._
120+
121+
## Optional: Control Concurrency with lambda_worker_max_concurrent_operations
122+
123+
AWS Service Catalog currently enforces a [hard limit of 5 account-related operations concurrently](https://docs.aws.amazon.com/controltower/latest/userguide/provision-and-manage-accounts.html#:~:text=You%20can%20perform%20up%20to%20five%20(5)%20account%2Drelated%20operations%20concurrently%2C%20including%20provisioning%2C%20updating%2C%20and%20enrolling.) that includes provisioning, updating, and enrolling. Exceeding this limit may result in throttling errors or failed updates.
124+
125+
To avoid hitting that limit (and prevent failed updates), you can configure the number of concurrent updates with the `lambda_worker_max_concurrent_operations` variable. Example:
126+
127+
```hcl
128+
inputs = {
129+
lambda_worker_max_concurrent_operations = X
130+
}
131+
```
132+
133+
This variable tells the worker Lambda to never initiate more than X updates at a time, which can be used to leave headroom for other processes (like provisioning new accounts) to succeed.
134+
135+
|Value | Behavior |
136+
| -- | -- |
137+
|`5` | Max concurrency allowed by AWS (use with caution) |
138+
|`<5` | Safe concurrency with headroom for other ops |
139+
|`1` | Serialized updates, safest but slowest |

sidebars/docs.js

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,11 @@ const sidebar = [
490490
type: "doc",
491491
id: "2.0/docs/accountfactory/guides/iam-roles",
492492
},
493+
{
494+
label: "Automatically Remediate AWS Control Tower Drift with Async Multi-Account Factory Module",
495+
type: "doc",
496+
id: "2.0/docs/accountfactory/guides/drift-remediation-with-async-module",
497+
},
493498
],
494499
},
495500
{

0 commit comments

Comments
 (0)