Skip to content

网络问题导致Eureka同步失败问题 #10

@yugj

Description

@yugj

网络问题导致Eureka同步失败问题

问题现象

测试时候发现,nacos同步Eureka时候一个节点失联了,Eureka Server日志如下

2022-06-09 10:49:30.830  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889
2022-06-09 10:49:40.842  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889
2022-06-09 10:49:50.860  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889
2022-06-09 10:50:00.877  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889
2022-06-09 10:50:10.893  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889
2022-06-09 10:50:20.904  WARN 41149 --- [cosSynchronizer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: NACOS-SERVICE-DEMO - nacos-service-demo:192.168.8.13:8889

代码追踪

NacosSynchronizer

public void syncService() throws Exception {
        ListView<String> serviceList = namingService.getServicesOfServer(1, 1000);

        for (String service : serviceList.getData()) {
            List<Instance> instances = namingService.getAllInstances(service);
            for (Instance instance : instances) {
                if (!isFromEureka(instance)) {
                    String instanceId = String.format("%s:%s:%s", service, instance.getIp(), instance.getPort());
                    peerAwareInstanceRegistry.renew(service.toUpperCase(), instanceId, false);
                }
            }

            List<ServiceInfo> list = namingService.getSubscribeServices();
            Optional<ServiceInfo> optional = list.stream().filter(serviceInfo -> serviceInfo.getName().equals(service)).findFirst();
            if (!optional.isPresent()) {
                namingService.subscribe(service, listener);
            }
        }

    }

代码做renew并监听nacos状态同步到eureka

其中这个代码peerAwareInstanceRegistry.renew(service.toUpperCase(), instanceId, false);处理逻辑

public boolean renew(String appName, String id, boolean isReplication) {
        RENEW.increment(isReplication);
        Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
        Lease<InstanceInfo> leaseToRenew = null;
        if (gMap != null) {
            leaseToRenew = gMap.get(id);
        }
        if (leaseToRenew == null) {
            RENEW_NOT_FOUND.increment(isReplication);
            logger.warn("DS: Registry: lease doesn't exist, registering resource: {} - {}", appName, id);
            return false;
        } else {
            // do shting
        }
}

当节点在Eureka不存在leaseToRenew == null,不会触发renew操作,也不会触发注册,若节点因为网络问题导致Eureka server和nacos节点失联
Eureka删除了nacos节点,在网络恢复后Eureka Server定时任务拉取节点做renew操作因为节点不存在不会做renew,nacos集群状态没有变更也不会触发NacosEventListener监听去注册节点,
最终导致即使网络恢复节点也无法同步到eureka
临时解决方案,是通过重启nacos节点重复NacosEventListener事件

是否应该在syncService里面做个兜底,类似,或判断存在性 做renew或register

if (!isFromEureka(instance)) {
      String instanceId = String.format("%s:%s:%s", service, instance.getIp(), instance.getPort());
      boolean renewSuccess = peerAwareInstanceRegistry.renew(service.toUpperCase(), instanceId, false);
      if(!renewSuccess) {
          peerAwareInstanceRegistry.register(xxx);
      }
  }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions