Lifecycle management - data, decommission, track (#930)

cjobermaier · web-flow · commit 365ade944d6d · 2025-09-18T16:06:54.000-05:00
diff --git a/CODEOWNERS b/CODEOWNERS
@@ -34,3 +34,6 @@
 # Sentinel documentation ownership
 /content/sentinel/ @hashicorp/team-docs-packer-and-terraform @hashicorp/tf-compliance
 
+# Well-architected framework
+
+/content/well-architected-framework/ @hashicorp/well-architected-education-approvers
diff --git a/content/well-architected-framework/data/docs-nav-data.json b/content/well-architected-framework/data/docs-nav-data.json
@@ -415,6 +415,19 @@
         "title": "Overview",
         "path": "optimize-systems"
       },
+      {
+        "title": "Lifecycle management",
+        "routes": [
+          {
+            "title": "Implement data retention policies",
+            "path": "optimize-systems/lifecycle-management/data-management"
+          },
+          {
+            "title": "Decommission resources",
+            "path": "optimize-systems/lifecycle-management/decommission-infrastructure"
+          }
+        ]
+      },
       {
         "title": "Monitor system health",
         "routes": [
diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/automate/cicd.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/automate/cicd.mdx
@@ -31,5 +31,6 @@ In this section of Automate your workflows, you learned how to implement CI/CD p
 
 Visit the following documents to learn more about the automation workflow:
 
-- [Automate testing](/well-architected-framework/define-and-automate-processes/automate/testing) - Implement automated testing in your CI/CD pipeline
-- [Automate deployments](/well-architected-framework/define-and-automate-processes/automate/deployments) - Deploy applications through your CI/CD pipeline 
+- [Automate testing](/well-architected-framework/define-and-automate-processes/automate/testing) in your CI/CD pipelines
+- [Automate application deployments](/well-architected-framework/define-and-automate-processes/automate/deployments through your CI/CD pipeline 
+- Learn how to orchestrate [Terraform runs](/terraform/tutorials/automation/automate-terraform) to ensure consistency between runs.
diff --git a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx
@@ -0,0 +1,94 @@
+---
+page_title: Implement data management policies
+description: Implement data management policies to reduce storage costs, ensure compliance, and manage data lifecycles with infrastructure as code.
+---
+
+# Implement data management policies
+
+You can use data management policies to manage the lifecycle of your organization's data. When you store data either in the cloud or on-premises, it is important to define and automate the policies around managing that data. Defining management with infrastructure as code tools, such as Terraform, ensures you consistently apply these policies across all environments and resources.
+
+## Why you should use lifecycle policies
+
+Most major cloud providers offer lifecycle management features for their storage services. These features allow you to define rules that automatically transition data between different storage classes based on age or access patterns, and delete data that has reached the end of its retention period. 
+
+When you implement data management policies, you gain the following benefits:
+- Reduce storage costs by automatically deleting data that is no longer needed.
+- Reduce storage costs by storing data in the most cost-effective storage class based on access patterns and retention requirements.
+- Ensure compliance with legal and regulatory requirements for data retention.
+- Minimize security risks by removing sensitive data after a defined period of time.
+
+## Automate policy management with infrastructure as code
+
+You can use Terraform to define and manage lifecycle policies and implement those policies across your organization. You can create Terraform modules to create data management policies for different data types and compliance requirements. These modules can automatically apply appropriate lifecycle rules, storage class transitions, and deletion policies to new or existing storage resources.
+
+The following Terraform configuration defines a [data lifecycle policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_lifecycle_configuration#specifying-a-filter-based-on-object-size) to move AWS S3 data to Glacier Instant Retrieval after 365 days:
+
+```hcl
+resource "aws_s3_bucket_lifecycle_configuration" "example" {
+  bucket = aws_s3_bucket.bucket.id
+
+  rule {
+    id = "Allow small object transitions"
+
+    filter {
+      object_size_greater_than = 1
+    }
+
+    status = "Enabled"
+
+    transition {
+      days          = 365
+      storage_class = "GLACIER_IR"
+    }
+  }
+}
+```
+
+Terraform can also tag resources with appropriate retention metadata. These tags can include creation dates, data classifications, and retention periods.
+
+For example, you can use the [`tag` block](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_lifecycle_configuration#specifying-a-filter-based-on-an-object-tag) with AWS S3 to automatically apply tags to all resources created by Terraform. The S3 lifecycle rule specifies a filter based on a tag key and value. The rule then applies only to a subset of objects with the specific tag.
+
+```hcl
+resource "aws_s3_bucket_lifecycle_configuration" "example" {
+  bucket = aws_s3_bucket.bucket.id
+
+  rule {
+    id = "rule-1"
+
+    filter {
+      tag {
+        key   = "Name"
+        value = "Staging"
+      }
+    }
+
+    transition {
+      days          = 30
+      storage_class = "GLACIER"
+    }
+
+    status = "Enabled"
+  }
+}
+```
+
+Other cloud providers, such as [Google Cloud Platform](https://registry.terraform.io/providers/hashicorp/google/5.0.0/docs/resources/storage_bucket.html#example-usage---life-cycle-settings-for-storage-bucket-objects) and [Microsoft Azure](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy), offer similar lifecycle management features for their storage services. You can use Terraform to manage lifecycle policies across multiple cloud providers, ensuring consistent data management practices regardless of where your data resides.
+
+HashiCorp resources:
+
+- Search the [Terraform Registry](https://registry.terraform.io/browse/providers) for the [cloud](https://registry.terraform.io/browse/providers?category=public-cloud) or [database](https://registry.terraform.io/browse/providers?category=database) provider you use.
+
+- Learn best practices for writing Terraform with the Terraform [style guide](/terraform/language/style).
+
+External resources:
+
+- Cloud storage: [AWS](https://aws.amazon.com/products/storage/), [GCP](https://cloud.google.com/products/storage), and [Azure](https://azure.microsoft.com/en-us/products/category/storage)
+- [Learn how to set the lifecycle configuration for a Google Cloud Bucket](https://cloud.google.com/storage/docs/samples/storage-create-lifecycle-setting-tf) with Terraform.
+- AWS [Enforce data retention policies](https://docs.aws.amazon.com/wellarchitected/latest/framework/cost_decomissioning_resources_data_retention.html)
+
+## Next steps
+
+In this section of Lifecycle management, you learned about implementing data management policies, including why you should use lifecycle policies and how to automate policy management with infrastructure as code. Implement data management policies is part of the [Optimize systems](/well-architected-framework/optimize-systems) pillar.
+
+To learn more about infrastructure and resource management, refer to the following resources:
+- [Automate infrastructure provisioning](/well-architected-framework/define-and-automate-processes/process-automation/process-automation-workflow)
diff --git a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx
@@ -0,0 +1,152 @@
+---
+page_title: Decommission resources
+description: Learn how to decommission infrastructure components while maintaining system integrity and avoiding disruptions through proper planning and automation.
+---
+
+# Decommission resources
+
+Resource decommissioning is the process of safely removing or deleting infrastructure components, applications, or services that are no longer needed or have reached end-of-life. You should remove unused or obsolete resources such as servers, databases, images, IAM, and other infrastructure components. 
+
+When you decommission unused resources, you gain the following benefits:
+- Reduce costs by removing charges associated with unused resources.
+- Minimize security risks by removing outdated or vulnerable resources that bad actors can exploit.
+- Reduce configuration drift by only running necessary resources.
+- Improve audit and compliance by maintaining a smaller infrastructure footprint.
+
+To successfully decommission resources, you need to create a well-defined plan that includes dependency analysis, stakeholder communication, and a gradual removal process. Depending on how your infrastructure implementation is done, either manually or automatically, you may need to adjust your decommissioning approach.
+
+## Find resources to decommission
+
+Before you begin decommissioning resources, you need to identify which resources exist in your environment and determine which ones are candidates for removal. This discovery phase helps you avoid accidentally removing resources that are still in use and ensures you target the right components for decommissioning.
+
+Start by creating an inventory of your infrastructure. Most cloud providers offer resource tagging and billing reports that help identify unused or underutilized resources. Pay particular attention to active resources created for temporary purposes, like testing or proof-of-concepts.
+
+Terraform tracks all infrastructure it manages with state files. You can use the `terraform state list` to see all managed resources and `terraform show` to examine their current configurations. This list of resources will help you identify which resources are still in use and which ones you can decommission.
+
+If you're using HCP Terraform, you can use the [workspace explorer](/terraform/cloud-docs/workspaces/explorer) feature to gain visibility into the resThat ources your organization manages with Terraform. The explorer provides a visual representation of your infrastructure, making it easier to identify resources that you no longer need.
+
+## Create a dependency plan
+
+Your plan should analyze which services, applications, or other resources rely on the components you plan to remove. Your plan will lower the risk of unexpected outages by identifying and addressing dependencies before decommissioning.
+
+If you are using infrastructure as code tools like Terraform, you can use a dependency graph to understand resource relationships. This graph can help you visualize connections between resources and identify potential impacts of removing specific components.
+
+The following command creates a dependency graph of your Terraform resources:
+
+```shell-session
+$ terraform graph -type=plan | dot -Tpng > graph.png
+```
+
+<Note>
+
+You need to install Graphviz on your system to use the `terraform graph` command and generate visualizations. For more information on installing Graphviz, refer to the [Graphviz installation guide](https://graphviz.org/download/).
+
+</Note>
+
+HashiCorp resources:
+
+- [Terraform graph command](/terraform/cli/commands/graph)
+
+## Create a communication plan 
+
+Your plan should outline how you will inform stakeholders about the decommissioning process, including timelines and potential impacts. Effective communication prevents surprises and ensures all affected teams can prepare for the changes.
+
+Start by identifying all stakeholders who might be affected by the decommissioning, including development teams, operations staff, end users, and business owners. Create a notification timeline that provides adequate warning. Your communications should explain what resources you are removing, when the decommissioning will occur, and what actions stakeholders need to take.
+
+## Create backups
+
+Before decommissioning, confirm that you have backups of any critical data or configurations associated with the resources you are removing. Backups provide a safety net in case you need to roll back changes.
+
+You may want to back up the following resources:
+- Servers in the form of machine images
+- Database snapshots
+- Configuration files
+- Metadata
+
+Since Terraform uses infrastructure as code to manage resources, you can redeploy resources that you have previously decommissioned by reapplying your Terraform configuration. This capability allows you to recover resources quickly if needed. 
+
+For example, if you backed up a server, you can also redeploy it by updating the AMI in your Terraform with the backed-up AMI ID. In the following example, you can change the `ami` attribute to the ID of your backed-up AMI:
+
+```hcl
+resource "aws_instance" "example" {
+  ami           = "ami-0c55b159cbfafe1f0"
+  instance_type = "t2.micro"
+}
+```
+
+You can also use Terraform to create [AWS EBS snapshots](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ebs_snapshot) before decommissioning instances. The following example creates an EBS snapshot of the root volume of an EC2 instance:
+
+```hcl
+resource "aws_ebs_volume" "example" {
+  availability_zone = "us-west-2a"
+  size              = 40
+
+  tags = {
+    Name = "HelloWorld"
+  }
+}
+
+resource "aws_ebs_snapshot" "example_snapshot" {
+  volume_id = aws_ebs_volume.example.id
+
+  tags = {
+    Name = "HelloWorld_snap"
+  }
+}
+```
+
+## Gradually remove resources
+
+Implement a phased approach to removing resources instead of doing it all at once. Start by redirecting traffic away from the resource, and monitor user traffic to ensure you don't negatively impact users.
+
+You can use `terraform plan` to preview the changes that will occur when you remove resources from your configuration. This command helps you understand the impact of your changes before applying them.
+
+You can also set safeguards so you only decommission resources when you are ready. You can use Terraform's `lifecycle` block with `prevent_destroy = true` to prevent accidental deletion of critical resources. The [lifecycle](/terraform/language/meta-arguments#lifecycle) setting ensures that you won't destroy resources unless you explicitly remove the `prevent_destroy` attribute.
+
+```hcl
+resource "aws_instance" "example" {
+  ami           = "ami-0c55b159cbfafe1f0"
+  instance_type = "t2.micro"
+}
+
+lifecycle {
+  prevent_destroy = true
+}
+```
+
+Consul can help you gradually remove resources by directing traffic away from services you are decommissioning. You can use Consul's service discovery and health checking features to monitor the status of services and ensure that dependent services are not affected during the decommissioning process.
+
+If you are using orchestration tools like Nomad or Kubernetes, you can use their built-in capabilities to drain workloads before decommissioning nodes gracefully. Nomad provides node drain functionality through the `nomad node drain` command, which prevents new scheduling new allocations on a node while safely migrating existing jobs to other available nodes. The Kubernetes `kubectl drain` command safely removes pods from nodes while respecting Pod Disruption Budgets, which ensure that a minimum number of application replicas remain available throughout the process. 
+
+HashiCorp resources:
+
+- Review the [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) documentation for strategies on how to redirect traffic and disable functions gradually.
+- Learn how to [manage resource lifecycles with Terraform](/terraform/tutorials/state/resource-lifecycle).
+- [Get up and running with Nomad](/nomad/tutorials/get-started) by learning about scheduling, setting up a cluster, and deploying an example job.
+- Learn the [fundamentals of Consul](/consul/tutorials).
+
+## Verify health of infrastructure and applications
+
+After the decommissioning process, verify that the remaining infrastructure and applications are functioning correctly. Monitor system performance and user feedback to ensure that there are no negative impacts.
+
+You should do the following steps after you decomission the resources:
+
+- Validate APIs are functioning.
+- Check application performance.
+- Monitor system logs for errors.
+
+HashiCorp resources:
+
+- [Learn to setup monitoring agents](/well-architected-framework/define-and-automate-processes/monitor/setup-monitoring-agents) and [dashboards and alerts](/well-architected-framework/define-and-automate-processes/monitor/dashboards-alerts).
+
+External resources:
+
+- AWS [Implement a decommissioning process](https://docs.aws.amazon.com/wellarchitected/latest/framework/cost_decomissioning_resources_implement_process.html)
+
+## Next steps
+
+In this section of Lifecycle management, you learned about decommissioning resources, including why you should plan decommissioning and how to safely execute the process. Decommission resources is part of the [Optimize systems](/well-architected-framework/optimize-systems) pillar.
+
+To learn more about infrastructure and resource management, refer to the following resource:
+
+- [Data management](/well-architected-framework/optimize-systems/lifecycle-management/data-management)