-
Notifications
You must be signed in to change notification settings - Fork 211
Description
What happened:
In an AKS cluster, I attempted to migrate a Pod (managed by a StatefulSet) from Node aks-automq-19985432-vmss00000r to Node aks-automq-19985432-vmss00000p by cordoning the original node and deleting the Pod.
Since the Pod is managed by a StatefulSet, the associated PVC and its underlying Disk needed to be detached from the original node and reattached to the new node. However, after deleting the Pod, it consistently failed to start on the new node, with the error:
2025-03-28T07:26:01Z FailedMapVolume Warning MapVolume.MapPodDevice failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Full Pod Events:
2025-03-28T07:00:18Z Killing Normal Stopping container automq-broker
<nil> FailedScheduling Warning 0/10 nodes are available: 1 node(s) were unschedulable, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 7 Insufficient cpu, 7 Insufficient memory.
<nil> FailedScheduling Warning 0/10 nodes are available: 1 node(s) were unschedulable, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 7 Insufficient cpu, 7 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node.
<nil> Scheduled Normal Successfully assigned kf-2989ir2rmxntrw8t/kf-2989ir2rmxntrw8t-automq-enterprise-broker-0 to aks-automq-19985432-vmss00000r
2025-03-28T07:00:22Z FailedAttachVolume Warning Multi-Attach error for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" Volume is already exclusively attached to one node and can't be attached to another
2025-03-28T07:00:55Z SuccessfulAttachVolume Normal AttachVolume.Attach succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"
2025-03-28T07:00:58Z SuccessfulMountVolume Normal MapVolume.MapPodDevice succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/dev"
2025-03-28T07:00:58Z SuccessfulMountVolume Normal MapVolume.MapPodDevice succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" volumeMapPath "/var/lib/kubelet/pods/8bff5466-adaf-46a6-8856-67bb68da28ab/volumeDevices/kubernetes.io~csi"
2025-03-28T07:00:58Z Pulling Normal Pulling image "automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849"
2025-03-28T07:00:59Z Pulled Normal Successfully pulled image "automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849" in 1.314s (1.314s including waiting). Image size: 814724362 bytes.
2025-03-28T07:00:59Z Created Normal Created container automq-broker
2025-03-28T07:00:59Z Started Normal Started container automq-broker
2025-03-28T07:20:59Z Killing Normal Stopping container automq-broker
<nil> FailedScheduling Warning 0/10 nodes are available: 1 node(s) were unschedulable, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 7 Insufficient cpu, 7 Insufficient memory.
<nil> FailedScheduling Warning 0/10 nodes are available: 1 node(s) were unschedulable, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 7 Insufficient cpu, 7 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node.
<nil> Scheduled Normal Successfully assigned kf-2989ir2rmxntrw8t/kf-2989ir2rmxntrw8t-automq-enterprise-broker-0 to aks-automq-19985432-vmss00000p
2025-03-28T07:21:03Z FailedAttachVolume Warning Multi-Attach error for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" Volume is already exclusively attached to one node and can't be attached to another
2025-03-28T07:22:44Z SuccessfulAttachVolume Normal AttachVolume.Attach succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"
2025-03-28T07:22:47Z SuccessfulMountVolume Normal MapVolume.MapPodDevice succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/dev"
2025-03-28T07:22:47Z SuccessfulMountVolume Normal MapVolume.MapPodDevice succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" volumeMapPath "/var/lib/kubelet/pods/b96af6d8-c343-4351-8512-ead7fa60e21e/volumeDevices/kubernetes.io~csi"
2025-03-28T07:22:47Z Pulling Normal Pulling image "automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849"
2025-03-28T07:22:48Z Pulled Normal Successfully pulled image "automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849" in 1.295s (1.295s including waiting). Image size: 814724362 bytes.
2025-03-28T07:22:48Z Created Normal Created container automq-broker
2025-03-28T07:22:48Z Started Normal Started container automq-broker
2025-03-28T07:23:01Z Killing Normal Stopping container automq-broker
2025-03-28T07:23:01Z Unhealthy Warning Readiness probe failed: dial tcp 10.224.1.3:9102: connect: connection refused
<nil> Scheduled Normal Successfully assigned kf-2989ir2rmxntrw8t/kf-2989ir2rmxntrw8t-automq-enterprise-broker-0 to aks-automq-19985432-vmss00000p
2025-03-28T07:23:05Z FailedMapVolume Warning MapVolume.WaitForAttach failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : volume attachment is being deleted
<nil> Scheduled Normal Successfully assigned kf-2989ir2rmxntrw8t/kf-2989ir2rmxntrw8t-automq-enterprise-broker-0 to aks-automq-19985432-vmss00000p
2025-03-28T07:23:09Z FailedMapVolume Warning MapVolume.WaitForAttach failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : volume attachment is being deleted
2025-03-28T07:23:37Z FailedMapVolume Warning MapVolume.WaitForAttach failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15 has GET error for volume attachment csi-799453fa286b826fc188cb5d02829284a6c28369f7dba99a7cc11e71dcbc6fdc: volumeattachments.storage.k8s.io "csi-799453fa286b826fc188cb5d02829284a6c28369f7dba99a7cc11e71dcbc6fdc" is forbidden: User "system:node:aks-automq-19985432-vmss00000p" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'aks-automq-19985432-vmss00000p' and this object
Azure does not have opinion for this user.
2025-03-28T07:23:45Z FailedMapVolume Warning MapVolume.WaitForAttach failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : watch error:unknown (get volumeattachments.storage.k8s.io) for volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15
2025-03-28T07:23:54Z SuccessfulAttachVolume Normal AttachVolume.Attach succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"
<nil> Scheduled Normal Successfully assigned kf-2989ir2rmxntrw8t/kf-2989ir2rmxntrw8t-automq-enterprise-broker-0 to aks-automq-19985432-vmss00000p
2025-03-28T07:25:52Z SuccessfulAttachVolume Normal AttachVolume.Attach succeeded for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"
2025-03-28T07:26:01Z FailedMapVolume Warning MapVolume.MapPodDevice failed for volume "pvc-5be5c28c-a881-431d-b1ac-6d291d50da15" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Container kube-system/csi-azuredisk-node-tq77m/azuredisk (runs on the Node aks-automq-19985432-vmss00000p) Logs:
I0328 07:22:44.854785 1 utils.go:105] GRPC call: /csi.v1.Node/NodeStageVolume
I0328 07:22:44.854811 1 utils.go:106] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"None","csi.storage.k8s.io/pv/name":"pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","csi.storage.k8s.io/pvc/name":"kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0","csi.storage.k8s.io/pvc/namespace":"kf-2989ir2rmxntrw8t","requestedsizegib":"20","skuName":"PremiumV2_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1742980023812-2230-disk.csi.azure.com"},"volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
I0328 07:22:45.979114 1 azure_common_linux.go:207] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdb under /dev/disk/azure/scsi1/
I0328 07:22:45.979154 1 utils.go:112] GRPC response: {}
I0328 07:22:45.986796 1 utils.go:105] GRPC call: /csi.v1.Node/NodePublishVolume
I0328 07:22:45.986809 1 utils.go:106] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"None","csi.storage.k8s.io/pv/name":"pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","csi.storage.k8s.io/pvc/name":"kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0","csi.storage.k8s.io/pvc/namespace":"kf-2989ir2rmxntrw8t","requestedsizegib":"20","skuName":"PremiumV2_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1742980023812-2230-disk.csi.azure.com"},"volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
I0328 07:22:47.071063 1 azure_common_linux.go:207] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdb under /dev/disk/azure/scsi1/
I0328 07:22:47.071096 1 nodeserver.go:265] NodePublishVolume [block]: found device path /dev/disk/azure/scsi1/lun0 with lun 0
I0328 07:22:47.071317 1 nodeserver.go:646] ensureBlockTargetFile [block]: making target file /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e
I0328 07:22:47.071383 1 nodeserver.go:280] NodePublishVolume: mounting /dev/disk/azure/scsi1/lun0 at /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e
I0328 07:22:47.071402 1 mount_linux.go:218] Mounting cmd (mount) with arguments ( -o bind /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e)
I0328 07:22:47.072342 1 mount_linux.go:218] Mounting cmd (mount) with arguments ( -o bind,remount /dev/disk/azure/scsi1/lun0 /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e)
I0328 07:22:47.073079 1 nodeserver.go:285] NodePublishVolume: mount /dev/disk/azure/scsi1/lun0 at /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e successfully
I0328 07:22:47.073099 1 utils.go:112] GRPC response: {}
I0328 07:23:29.814143 1 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0328 07:23:29.814166 1 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e","volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
I0328 07:23:29.814209 1 nodeserver.go:302] NodeUnpublishVolume: unmounting volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15 on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e
I0328 07:23:29.814229 1 mount_helper_common.go:93] unmounting "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I0328 07:23:29.814240 1 mount_linux.go:360] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e
I0328 07:23:29.816145 1 mount_helper_common.go:150] Deleting path "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e"
I0328 07:23:29.816323 1 nodeserver.go:312] NodeUnpublishVolume: unmount volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15 on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/b96af6d8-c343-4351-8512-ead7fa60e21e successfully
I0328 07:23:29.816346 1 utils.go:112] GRPC response: {}
I0328 07:24:01.738561 1 utils.go:105] GRPC call: /csi.v1.Node/NodePublishVolume
I0328 07:24:01.738577 1 utils.go:106] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/288651ac-d154-4156-9182-00f28451ec6c","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"None","csi.storage.k8s.io/pv/name":"pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","csi.storage.k8s.io/pvc/name":"kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0","csi.storage.k8s.io/pvc/namespace":"kf-2989ir2rmxntrw8t","requestedsizegib":"20","skuName":"PremiumV2_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1742980023812-2230-disk.csi.azure.com"},"volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
I0328 07:26:01.804609 1 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0328 07:26:01.804631 1 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/288651ac-d154-4156-9182-00f28451ec6c","volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
I0328 07:26:01.804672 1 nodeserver.go:302] NodeUnpublishVolume: unmounting volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15 on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/288651ac-d154-4156-9182-00f28451ec6c
W0328 07:26:01.804711 1 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/288651ac-d154-4156-9182-00f28451ec6c
I0328 07:26:01.804721 1 nodeserver.go:312] NodeUnpublishVolume: unmount volume /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15 on /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/288651ac-d154-4156-9182-00f28451ec6c successfully
I0328 07:26:01.804738 1 utils.go:112] GRPC response: {}
I0328 07:26:01.928465 1 utils.go:105] GRPC call: /csi.v1.Node/NodePublishVolume
I0328 07:26:01.928479 1 utils.go:106] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/staging/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15/16030b1f-abd5-40cd-98df-bd26424fac25","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"None","csi.storage.k8s.io/pv/name":"pvc-5be5c28c-a881-431d-b1ac-6d291d50da15","csi.storage.k8s.io/pvc/name":"kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0","csi.storage.k8s.io/pvc/namespace":"kf-2989ir2rmxntrw8t","requestedsizegib":"20","skuName":"PremiumV2_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1742980023812-2230-disk.csi.azure.com"},"volume_id":"/subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15"}
Kernel Logs of the Node aks-automq-19985432-vmss00000p:
[Fri Mar 28 07:21:10 2025] scsi 1:0:0:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 5
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: Attached scsi generic sg2 type 0
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: [sdb] 5242880 4096-byte logical blocks: (21.5 GB/20.0 GiB)
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: [sdb] Write Protect is off
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: [sdb] Mode Sense: 0f 00 10 00
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
[Fri Mar 28 07:21:10 2025] sd 1:0:0:0: [sdb] Attached SCSI disk
[Fri Mar 28 07:22:46 2025] loop0: detected capacity change from 0 to 41943040
[Fri Mar 28 07:22:46 2025] IPv6: ADDRCONF(NETDEV_CHANGE): azvfa1bc3b9c582: link becomes ready
[Fri Mar 28 07:22:46 2025] IPv6: ADDRCONF(NETDEV_CHANGE): azvfa1bc3b9c58: link becomes ready
[Fri Mar 28 07:22:46 2025] eth0: renamed from azvfa1bc3b9c582
[Fri Mar 28 07:23:02 2025] hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb: tag#59 cmd 0x28 status: scsi 0x2 srb 0xa hv 0xc0000001
[Fri Mar 28 07:23:02 2025] hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb: tag#59 cmd 0x3 status: scsi 0x0 srb 0x20 hv 0xc0000001
[Fri Mar 28 07:23:27 2025] sd 1:0:0:0: Device offlined - not ready after error recovery
[Fri Mar 28 07:23:27 2025] sd 1:0:0:0: rejecting I/O to offline device
[Fri Mar 28 07:23:27 2025] blk_update_request: I/O error, dev sdb, sector 2432888 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
[Fri Mar 28 07:24:01 2025] scsi: waiting for bus probes to complete ...
[Fri Mar 28 07:26:15 2025] INFO: task azurediskplugin:6053 blocked for more than 120 seconds.
[Fri Mar 28 07:26:16 2025] Not tainted 5.15.0-1081-azure #90-Ubuntu
[Fri Mar 28 07:26:16 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 28 07:26:16 2025] task:azurediskplugin state:D stack: 0 pid: 6053 ppid: 4348 flags:0x00000000
[Fri Mar 28 07:26:16 2025] Call Trace:
[Fri Mar 28 07:26:16 2025] <TASK>
[Fri Mar 28 07:26:16 2025] __schedule+0x243/0x5f0
[Fri Mar 28 07:26:16 2025] schedule+0x69/0x110
[Fri Mar 28 07:26:16 2025] schedule_timeout+0x105/0x140
[Fri Mar 28 07:26:16 2025] __wait_for_common+0xa8/0x150
[Fri Mar 28 07:26:16 2025] ? usleep_range_state+0x90/0x90
[Fri Mar 28 07:26:16 2025] wait_for_completion+0x24/0x30
[Fri Mar 28 07:26:16 2025] scsi_complete_async_scans+0x102/0x160
[Fri Mar 28 07:26:16 2025] scsi_scan_host_selected+0xb3/0x120
[Fri Mar 28 07:26:16 2025] store_scan+0x142/0x150
[Fri Mar 28 07:26:16 2025] dev_attr_store+0x17/0x30
[Fri Mar 28 07:26:16 2025] sysfs_kf_write+0x3e/0x50
[Fri Mar 28 07:26:16 2025] kernfs_fop_write_iter+0x13b/0x1c0
[Fri Mar 28 07:26:16 2025] new_sync_write+0x114/0x1a0
[Fri Mar 28 07:26:16 2025] ? generic_add_lease+0x200/0x390
[Fri Mar 28 07:26:16 2025] vfs_write+0x1d8/0x270
[Fri Mar 28 07:26:16 2025] ksys_write+0x67/0xf0
[Fri Mar 28 07:26:16 2025] __x64_sys_write+0x19/0x20
[Fri Mar 28 07:26:16 2025] x64_sys_call+0x47c/0x1fa0
[Fri Mar 28 07:26:16 2025] do_syscall_64+0x56/0xb0
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? syscall_exit_to_user_mode+0x1e/0x50
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? do_syscall_64+0x63/0xb0
[Fri Mar 28 07:26:16 2025] ? exit_to_user_mode_prepare+0x49/0x100
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? syscall_exit_to_user_mode+0x1e/0x50
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? do_syscall_64+0x63/0xb0
[Fri Mar 28 07:26:16 2025] ? syscall_exit_to_user_mode+0x1e/0x50
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? do_syscall_64+0x63/0xb0
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? do_syscall_64+0x63/0xb0
[Fri Mar 28 07:26:16 2025] ? irqentry_exit+0x1d/0x30
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? sysvec_hyperv_stimer0+0x4e/0x90
[Fri Mar 28 07:26:16 2025] entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[Fri Mar 28 07:26:16 2025] RIP: 0033:0x47d46e
[Fri Mar 28 07:26:16 2025] RSP: 002b:000000c0005ff2c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[Fri Mar 28 07:26:16 2025] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 000000000047d46e
[Fri Mar 28 07:26:16 2025] RDX: 0000000000000005 RSI: 000000c0005acff5 RDI: 0000000000000006
[Fri Mar 28 07:26:16 2025] RBP: 000000c0005ff308 R08: 0000000000000000 R09: 0000000000000000
[Fri Mar 28 07:26:16 2025] R10: 0000000000000000 R11: 0000000000000202 R12: 000000c0005ff438
[Fri Mar 28 07:26:16 2025] R13: 0000000000000000 R14: 000000c000582540 R15: 00007fffffffffff
[Fri Mar 28 07:26:16 2025] </TASK>
[Fri Mar 28 07:26:16 2025] INFO: task kworker/u8:0:258917 blocked for more than 120 seconds.
[Fri Mar 28 07:26:16 2025] Not tainted 5.15.0-1081-azure #90-Ubuntu
[Fri Mar 28 07:26:16 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 28 07:26:16 2025] task:kworker/u8:0 state:D stack: 0 pid:258917 ppid: 2 flags:0x00004000
[Fri Mar 28 07:26:16 2025] Workqueue: storvsc_error_wq_1 storvsc_remove_lun
[Fri Mar 28 07:26:16 2025] Call Trace:
[Fri Mar 28 07:26:16 2025] <TASK>
[Fri Mar 28 07:26:16 2025] __schedule+0x243/0x5f0
[Fri Mar 28 07:26:16 2025] ? srso_alias_return_thunk+0x5/0x7f
[Fri Mar 28 07:26:16 2025] ? update_load_avg+0x82/0x660
[Fri Mar 28 07:26:16 2025] schedule+0x69/0x110
[Fri Mar 28 07:26:16 2025] schedule_preempt_disabled+0xe/0x20
[Fri Mar 28 07:26:16 2025] __mutex_lock.constprop.0+0x267/0x490
[Fri Mar 28 07:26:16 2025] __mutex_lock_slowpath+0x13/0x20
[Fri Mar 28 07:26:16 2025] mutex_lock+0x38/0x50
[Fri Mar 28 07:26:16 2025] scsi_remove_device+0x1f/0x40
[Fri Mar 28 07:26:16 2025] storvsc_remove_lun+0x46/0x70
[Fri Mar 28 07:26:16 2025] process_one_work+0x225/0x3d0
[Fri Mar 28 07:26:16 2025] worker_thread+0x53/0x410
[Fri Mar 28 07:26:16 2025] ? process_one_work+0x3d0/0x3d0
[Fri Mar 28 07:26:16 2025] kthread+0x12a/0x150
[Fri Mar 28 07:26:16 2025] ? set_kthread_struct+0x50/0x50
[Fri Mar 28 07:26:16 2025] ret_from_fork+0x22/0x30
[Fri Mar 28 07:26:16 2025] </TASK>
Output of lsblk:
NAME KNAME MAJ:MIN FSTYPE SIZE TYPE MOUNTPOINT UUID PARTUUID LABEL MODEL SERIAL STATE OWNER GROUP MODE ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
sda sda 8:0 128G disk Virtual Disk 60022480c9f175130a23712edcd90c7a running root disk brw-rw---- 0 4096 0 4096 512 1 none 128 128 0B
├─sda1 sda1 8:1 ext4 127.9G part / 53814baa-d87c-4a19-905f-cb5736cff282 16272062-7328-4878-85b9-462d200f6af0 cloudimg-rootfs root disk brw-rw---- 0 4096 0 4096 512 1 none 128 128 0B
├─sda14 sda14 8:14 4M part a3ebc018-454e-4d0a-8785-fc44af01944a root disk brw-rw---- 0 4096 0 4096 512 1 none 128 128 0B
└─sda15 sda15 8:15 vfat 106M part /boot/efi 19BB-CF50 e0d89905-dfa1-4c34-8cdb-4ae496103fd2 UEFI root disk brw-rw---- 0 4096 0 4096 512 1 none 128 128 0B
sdb sdb 8:16 20G disk Virtual Disk 1941A6B0901CD149B6FDCF0747A94961 offline root disk brw-rw---- 0 4096 0 4096 4096 0 none 316 128 0B
sr0 sr0 11:0 736K rom Virtual DVD-ROM 4d534654202020207305e3437703544694957d7ced624a7d running root cdrom brw-rw---- 0 2048 0 2048 2048 1 none 316 128 0B
Output of lsscsi:
[0:0:0:0] disk Msft Virtual Disk 1.0 /dev/sda
dir: /sys/bus/scsi/devices/0:0:0:0 [/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host0/target0:0:0/0:0:0:0]
[0:0:0:2] cd/dvd Msft Virtual DVD-ROM 1.0 /dev/sr0
dir: /sys/bus/scsi/devices/0:0:0:2 [/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host0/target0:0:0/0:0:0:2]
[1:0:0:0] disk Msft Virtual Disk 1.0 /dev/sdb
dir: /sys/bus/scsi/devices/1:0:0:0 [/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host1/target1:0:0/1:0:0:0]
What you expected to happen:
The PVC-associated Disk should be properly attached to the new node, allowing the Pod to start successfully.
How to reproduce it:
A stable reproduction method is not confirmed, but repeatedly performing the same migration operation (cordon Node + delete Pod) on the same Pod increases the likelihood of encountering this issue.
Anything else we need to know?:
Pod Declaration:
apiVersion: v1
kind: Pod
metadata:
annotations:
automq.com/spec-id: spec-kcejjsvt25uer3tg
prometheus.io/automq-path: /metrics
prometheus.io/automq-port: "9090"
prometheus.io/automq-scrape: "true"
creationTimestamp: "2025-03-28T08:06:27Z"
generateName: kf-2989ir2rmxntrw8t-automq-enterprise-broker-
labels:
app.kubernetes.io/component: broker
app.kubernetes.io/instance: kf-2989ir2rmxntrw8t-automq-enterprise
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: automq-enterprise
app.kubernetes.io/version: 1.4.0
apps.kubernetes.io/pod-index: "0"
automq.cloud/environment-id: env-kjojvupwxhswgvi1
automq.cloud/instance-id: kf-2989ir2rmxntrw8t
controller-revision-hash: kf-2989ir2rmxntrw8t-automq-enterprise-broker-6cbdc4bfbd
helm.sh/chart: automq-enterprise-1.4.0
statefulset.kubernetes.io/pod-name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
namespace: kf-2989ir2rmxntrw8t
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker
uid: 38ce0d2b-f722-4db7-8b1e-ac6b6de1d4cd
resourceVersion: "4384800"
uid: 0a9fe09f-4bae-47cf-851e-c9d188f9d5e4
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/agentpool
operator: In
values:
- automq
- key: node.kubernetes.io/instance-type
operator: In
values:
- Standard_D4as_v5
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- kf-2989ir2rmxntrw8t-automq-enterprise
- key: app.kubernetes.io/component
operator: In
values:
- broker
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: CPU_LIMIT
valueFrom:
resourceFieldRef:
containerName: automq-broker
divisor: "0"
resource: limits.cpu
- name: MEM_LIMIT
valueFrom:
resourceFieldRef:
containerName: automq-broker
divisor: 1Mi
resource: limits.memory
- name: LOG_DIR
value: /data/logs
- name: KAFKA_JVM_PERFORMANCE_OPTS
value: -server -XX:+UseZGC -XX:ZCollectionInterval=5
- name: KAFKA_OPTS
value: -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError
- name: AUTOMQ_BASE_NODE_ID
value: "1000"
- name: KAFKA_HEAP_OPTS
value: -Xmx6g -Xms6g -XX:MaxDirectMemorySize=6g -XX:MetaspaceSize=96m
image: automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /opt/automq/automq.sh
- stop
livenessProbe:
failureThreshold: 4
initialDelaySeconds: 60
periodSeconds: 30
successThreshold: 1
tcpSocket:
port: 9102
timeoutSeconds: 1
name: automq-broker
ports:
- containerPort: 9102
protocol: TCP
- containerPort: 9092
protocol: TCP
readinessProbe:
failureThreshold: 8
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 9102
timeoutSeconds: 1
resources:
limits:
cpu: "4"
memory: 16Gi
requests:
cpu: "3"
memory: 13Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeDevices:
- devicePath: /dev/wal
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data
volumeMounts:
- mountPath: /data/conf
name: conf
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-zsfj4
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
nodeName: aks-automq-19985432-vmss00000p
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: automq-sa
serviceAccountName: automq-sa
subdomain: kf-2989ir2rmxntrw8t-automq-enterprise-broker-headless
terminationGracePeriodSeconds: 300
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: automq
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
- name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data
persistentVolumeClaim:
claimName: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
- configMap:
defaultMode: 420
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-conf
name: conf
- name: kube-api-access-zsfj4
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2025-03-28T08:06:27Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2025-03-28T08:06:27Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2025-03-28T08:06:27Z"
message: 'containers with unready status: [automq-broker]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2025-03-28T08:06:27Z"
message: 'containers with unready status: [automq-broker]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2025-03-28T08:06:27Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: automq-docker-registry-registry.cn-hangzhou.cr.aliyuncs.com/automq/automq-enterprise:temp-ny-test-20250321085849
imageID: ""
lastState: {}
name: automq-broker
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreating
hostIP: 10.224.1.14
hostIPs:
- ip: 10.224.1.14
phase: Pending
qosClass: Burstable
startTime: "2025-03-28T08:06:27Z"
PVC Declaration:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: disk.csi.azure.com
volume.kubernetes.io/selected-node: aks-automq-19985432-vmss00000g
volume.kubernetes.io/storage-provisioner: disk.csi.azure.com
creationTimestamp: "2025-03-28T02:45:16Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app.kubernetes.io/component: broker
app.kubernetes.io/instance: kf-2989ir2rmxntrw8t-automq-enterprise
app.kubernetes.io/name: automq-enterprise
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
namespace: kf-2989ir2rmxntrw8t
resourceVersion: "4268524"
uid: 5be5c28c-a881-431d-b1ac-6d291d50da15
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: automq-disk-azure-premium-ssd
volumeMode: Block
volumeName: pvc-5be5c28c-a881-431d-b1ac-6d291d50da15
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
phase: BoundPV Declaration:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/provisioned-by: disk.csi.azure.com
volume.kubernetes.io/provisioner-deletion-secret-name: ""
volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
creationTimestamp: "2025-03-28T02:45:22Z"
finalizers:
- external-provisioner.volume.kubernetes.io/finalizer
- kubernetes.io/pv-protection
- external-attacher/disk-csi-azure-com
name: pvc-5be5c28c-a881-431d-b1ac-6d291d50da15
resourceVersion: "4268532"
uid: fae5a84f-8125-4af3-a9a7-93aeaeb5229a
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
namespace: kf-2989ir2rmxntrw8t
resourceVersion: "4268502"
uid: 5be5c28c-a881-431d-b1ac-6d291d50da15
csi:
driver: disk.csi.azure.com
volumeAttributes:
cachingmode: None
csi.storage.k8s.io/pv/name: pvc-5be5c28c-a881-431d-b1ac-6d291d50da15
csi.storage.k8s.io/pvc/name: kf-2989ir2rmxntrw8t-automq-enterprise-broker-data-kf-2989ir2rmxntrw8t-automq-enterprise-broker-0
csi.storage.k8s.io/pvc/namespace: kf-2989ir2rmxntrw8t
requestedsizegib: "20"
skuName: PremiumV2_LRS
storage.kubernetes.io/csiProvisionerIdentity: 1742980023812-2230-disk.csi.azure.com
volumeHandle: /subscriptions/218357d0-eaaf-4e3e-9ffa-6b4ccb7e2df9/resourceGroups/MC_AutoMQ_qiuzhang-k8s-1_koreacentral/providers/Microsoft.Compute/disks/pvc-5be5c28c-a881-431d-b1ac-6d291d50da15
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.disk.csi.azure.com/zone
operator: In
values:
- koreacentral-1
persistentVolumeReclaimPolicy: Delete
storageClassName: automq-disk-azure-premium-ssd
volumeMode: Block
status:
lastPhaseTransitionTime: "2025-03-28T02:45:22Z"
phase: BoundEnvironment:
- CSI Driver version: v1.30.9
- Kubernetes version (use
kubectl version):- Client Version: v1.26.3
- Kustomize Version: v4.5.7
- Server Version: v1.30.9
- OS (e.g. from /etc/os-release):
Ubuntu 22.04.5 LTS - Kernel (e.g.
uname -a):Linux aks-automq-19985432-vmss00000P 5.15.0-1081-azure #90-Ubuntu SMP Tue Jan 28 05:15:28 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux - Install tools: apt (I'm not sure what "Install tools" means)
- Others: