Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 33 additions & 12 deletions playbook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,19 @@ Supports enabling `etcd Overload Protection` and `APF Flow Control` [APF Rate Li
| `inject-stress-list-qps` | `int` | "100" | QPS per stress test Pod |
| `inject-stress-total-duration` | `string` | "30s" | Total test duration (e.g. 30s, 5m) |

**Recommended Parameters for TKE Clusters**

| Cluseter Level | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
| L5 | 10000 | 100 | 10 | 6 | 200 |
| L50 | 10000 | 300 | 10 | 6 | 200 |
| L100 | 50000 | 500 | 20 | 6 | 200 |
| L200 | 100000 | 1000 | 50 | 9 | 200 |
| L500 | 100000 | 1000 | 50 | 12 | 200 |
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
| L5000 | 100000 | 10000 | 500 | 21 | 500 |

**etcd Overload Protection & Enhanced APF**

Tencent Cloud TKE team has developed these core protection features:
Expand All @@ -56,31 +69,39 @@ Supported versions:
**playbook**: `workflow/coredns-disruption-scenario.yaml`

This scenario simulates coredns service disruption by:
1. Scaling coredns Deployment replicas to 0
2. Maintaining zero replicas for specified duration
3. Restoring original replica count

1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing

2. **Component Shutdown**: Log in to the Argo Web UI, click on `coredns-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the coredns Deployment replicas to 0

3. **Service Validation**: During the coredns disruption, you can verify whether your services are affected by the coredns disruption

4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the coredns Deployment replicas

**Parameters**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |

## kubernetes-proxy Disruption

**playbook**: `workflow/kubernetes-proxy-disruption-scenario.yaml`

This scenario simulates kubernetes-proxy service disruption by:
1. Scaling kubernetes-proxy Deployment replicas to 0
2. Maintaining zero replicas for specified duration
3. Restoring original replica count

1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing

2. **Component Shutdown**: Log in to the Argo Web UI, click on `kubernetes-proxy-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the kubernetes-proxy Deployment replicas to 0

3. **Service Validation**: During the kubernetes-proxy disruption, you can verify whether your services are affected by the kubernetes-proxy disruption

4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the kubernetes-proxy Deployment replicas

**Parameters**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |

## Namespace Deletion Protection
Expand Down Expand Up @@ -140,10 +161,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `region` | `string` | `<REGION>` | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
| `secret-id` | `string` | `<SECRET_ID>` | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
| `secret-key` | `string` | `<SECRET_KEY>` | Tencent Cloud API secret key |
| `cluster-id` | `string` | `<CLUSTER_ID>` | Target cluster ID |
| `region` | `string` | "" | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
| `secret-id` | `string` | "" | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
| `secret-key` | `string` | "" | Tencent Cloud API secret key |
| `cluster-id` | `string` | "" | Target cluster ID |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |

**Notes**
Expand Down
39 changes: 27 additions & 12 deletions playbook/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,19 @@
| `inject-stress-list-qps` | `int` | "100" | 每个发压`Pod`的`QPS` |
| `inject-stress-total-duration` | `string` | "30s" | 发压执行总时长(如30s,5m等) |

**TKE集群推荐压测参数**

| 集群规格 | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
| L5 | 10000 | 100 | 10 | 6 | 200 |
| L50 | 10000 | 300 | 10 | 6 | 200 |
| L100 | 50000 | 500 | 20 | 6 | 200 |
| L200 | 100000 | 1000 | 50 | 9 | 200 |
| L500 | 100000 | 1000 | 50 | 12 | 200 |
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
| L5000 | 100000 | 10000 | 500 | 21 | 500 |

**etcd过载保护&增强apf限流说明**

腾讯云TKE团队在社区版本基础上开发了以下核心保护特性:
Expand All @@ -56,31 +69,33 @@
**playbook**:`workflow/coredns-disruption-scenario.yaml`

该场景通过以下方式构造`coredns`服务中断:
1. 将`coredns Deployment`副本数缩容到`0`
2. 维持指定时间副本数为`0`
3. 恢复原有副本数

1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
2. **组件停机**:登录argo Web UI,点击`coredns-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数缩容到`0`
3. **业务验证**:`coredns`停服期间,您可以去验证您的业务是否受到`cordns`停服的影响
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数恢复

**参数说明**

| 参数名称 | 类型 | 默认值 | 说明 |
|---------|------|--------|------|
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |

## kubernetes-proxy停服

**playbook**:`workflow/kubernetes-proxy-disruption-scenario.yaml`

该场景通过以下方式构造`kubernetes-proxy`服务中断:
1. 将`kubernetes-proxy` `Deployment`副本数缩容到0
2. 维持指定时间副本数为`0`
3. 恢复原有副本数

1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
2. **组件停机**:登录argo Web UI,点击`kubernetes-proxy-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数缩容到`0`
3. **业务验证**:`kubernetes-proxy`停服期间,您可以去验证您的业务是否受到`kubernetes-proxy`停服的影响
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数恢复

**参数说明**

| 参数名称 | 类型 | 默认值 | 说明 |
|---------|------|--------|------|
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |

## 命名空间删除防护
Expand Down Expand Up @@ -139,10 +154,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya

| 参数名称 | 类型 | 默认值 | 说明 |
|---------|------|--------|------|
| `region` | `string` | `<REGION>` | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
| `secret-id` | `string` | `<SECRET_ID>` | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
| `secret-key` | `string` | `<SECRET_KEY>` | 腾讯云API密钥 |
| `cluster-id` | `string` | `<CLUSTER_ID>` | 演练集群ID |
| `region` | `string` | "" | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
| `secret-id` | `string` | "" | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
| `secret-key` | `string` | "" | 腾讯云API密钥 |
| `cluster-id` | `string` | "" | 演练集群ID |
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |

**注意事项**
Expand Down
Loading