Skip to content

Connection Issue with Ray Cluster After Upgrading to Version 2.8.0 ! #41643

@sip-aravind-g

Description

@sip-aravind-g

What happened + What you expected to happen

What happened + What you expected to happen

I recently performed an upgrade from Ray cluster version 2.6.1 to version 2.8.0 and the Kuberay operator version v0.6.0, I am using, Subsequent to this upgrade, while attempting to connect to the Ray cluster using the Ray Python client package installed with version 2.8.0, I consistently encounter an error that prevents successful job execution.

Previously, with Ray version 2.6.1, the cluster operated seamlessly. Could you provide insights or assistance in resolving this connectivity issue with the upgraded Ray version 2.8.0?

ERROR:

SIGTERM handler is not set because current thread is not the main thread.
2023-11-24 11:52:56,517 WARNING dataclient.py:403 -- Encountered connection issues in the data channel. Attempting to reconnect.
2023-11-24 11:53:26,729 WARNING dataclient.py:410 -- Failed to reconnect the data channel

ConnectionError Traceback (most recent call last)
Cell In[8], line 2
1 import ray
----> 2 ray.init(address="ray://kuberay-head-svc.kuberay:10001")
3 ray.shutdown()

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:103, in client_mode_hook..wrapper(*args, **kwargs)
101 if func.name != "init" or is_client_mode_enabled_by_default:
102 return getattr(ray, func.name)(*args, **kwargs)
--> 103 return func(*args, **kwargs)

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/_private/worker.py:1379, in init(address, num_cpus, num_gpus, resources, labels, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
1377 passed_kwargs.update(kwargs)
1378 builder._init_args(**passed_kwargs)
-> 1379 ctx = builder.connect()
1380 from ray._private.usage import usage_lib
1382 if passed_kwargs.get("allow_multiple") is True:

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/client_builder.py:173, in ClientBuilder.connect(self)
170 if self._allow_multiple_connections:
171 old_ray_cxt = ray.util.client.ray.set_context(None)
--> 173 client_info_dict = ray.util.client_connect.connect(
174 self.address,
175 job_config=self._job_config,
176 _credentials=self._credentials,
177 ray_init_kwargs=self._remote_init_kwargs,
178 metadata=self._metadata,
179 )
181 dashboard_url = ray.util.client.ray._get_dashboard_url()
183 cxt = ClientContext(
184 dashboard_url=dashboard_url,
185 python_version=client_info_dict["python_version"],
(...)
190 _context_to_restore=ray.util.client.ray.get_context(),
191 )

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client_connect.py:55, in connect(conn_str, secure, metadata, connection_retries, job_config, namespace, ignore_version, _credentials, ray_init_kwargs)
50 _explicitly_enable_client_mode()
52 # TODO(barakmich): #13274
53 # for supporting things like cert_path, ca_path, etc and creating
54 # the correct metadata
---> 55 conn = ray.connect(
56 conn_str,
57 job_config=job_config,
58 secure=secure,
59 metadata=metadata,
60 connection_retries=connection_retries,
61 namespace=namespace,
62 ignore_version=ignore_version,
63 _credentials=_credentials,
64 ray_init_kwargs=ray_init_kwargs,
65 )
66 return conn

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/init.py:250, in RayAPIStub.connect(self, *args, **kw_args)
248 def connect(self, *args, **kw_args):
249 self.get_context()._inside_client_test = self._inside_client_test
--> 250 conn = self.get_context().connect(*args, **kw_args)
251 global _lock, _all_contexts
252 with _lock:

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/init.py:100, in _ClientContext.connect(self, conn_str, job_config, secure, metadata, connection_retries, namespace, ignore_version, _credentials, ray_init_kwargs)
92 self.client_worker = Worker(
93 conn_str,
94 secure=secure,
(...)
97 connection_retries=connection_retries,
98 )
99 self.api.worker = self.client_worker
--> 100 self.client_worker._server_init(job_config, ray_init_kwargs)
101 conn_info = self.client_worker.connection_info()
102 self._check_versions(conn_info, ignore_version)

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/worker.py:847, in Worker._server_init(self, job_config, ray_init_kwargs)
843 job_config.set_runtime_env(runtime_env, validate=True)
845 serialized_job_config = pickle.dumps(job_config)
--> 847 response = self.data_client.Init(
848 ray_client_pb2.InitRequest(
849 job_config=serialized_job_config,
850 ray_init_kwargs=json.dumps(ray_init_kwargs),
851 reconnect_grace_period=self._reconnect_grace_period,
852 )
853 )
854 if not response.ok:
855 raise ConnectionAbortedError(
856 f"Initialization failure from server:\n{response.msg}"
857 )

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/dataclient.py:519, in DataClient.Init(self, request, context)
513 def Init(
514 self, request: ray_client_pb2.InitRequest, context=None
515 ) -> ray_client_pb2.InitResponse:
516 datareq = ray_client_pb2.DataRequest(
517 init=request,
518 )
--> 519 resp = self._blocking_send(datareq)
520 return resp.init

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/dataclient.py:458, in DataClient._blocking_send(self, req)
455 self.outstanding_requests[req_id] = req
457 self.cv.wait_for(lambda: req_id in self.ready_data or self._in_shutdown)
--> 458 self._check_shutdown()
460 data = self.ready_data[req_id]
461 del self.ready_data[req_id]

File /opt/conda/envs/ray/lib/python3.8/site-packages/ray/util/client/dataclient.py:511, in DataClient._check_shutdown(self)
505 else:
506 msg = (
507 "Request can't be sent because the Ray client has already "
508 "been disconnected."
509 )
--> 511 raise ConnectionError(msg)

ConnectionError: Request can't be sent because the Ray client has already been disconnected due to an error. Last exception: Failed to reconnect within the reconnection grace period (30s)

Versions / Dependencies

Docker images used:

RAY:
rayproject/ray-ml:2.8.0-py38-gpu
rayproject/ray-ml:2.8.0-py38-cpu
rayproject/ray:2.8.0-py38-gpu
rayproject/ray:2.8.0-py38-cpu

Operator:
kuberay/operator:v0.6.0
Uploading 285452191-e7731e05-d1be-41bc-bbcf-8cdbbb9480c1.png…

Reproduction script

import ray
ray.init(address="ray://kuberay-head-svc.kuberay:10001")
ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-clientray client related issuesusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions