Skip to content

Commit 9a1f789

Browse files
wcharginstephanwlee
authored andcommitted
data: check liveness before blessing data server (#4851)
Summary: Issue #4844 shows that there are circumstances when communication between TensorBoard and the local data server process cannot be established. As of this patch, we send a trivial RPC to the data server and wait for its response before committing to use the new loading paths. This adds about 1–5ms total time to the happy path on my machine, with a worst case penalty of 5s due to the timeout. If the server is not reachable, we print a warning and fall back to the legacy paths. Test Plan: - Test with normal working TensorBoard. - Simulate failed data server connection by changing the definition of `addr` to `"localhost:%d" % (port + 777)` on line 175. This should print the `UNAVAILABLE`/“failed to connect” message from #4844. - Simulate slow data server by adding the following to `cli.rs`: ```rust tokio::time::sleep(Duration::from_secs(3)).await; ``` Add this right before the `Server::builder().(...).await?` call at the end of `main`: i.e., after we write the port file, but before we actually respond to requests. Note that TensorBoard still works with a 3-second delay, but that it actually delays printing the startup message for those 3 seconds as it determines which data provider to use. - Simulate extra-slow data provider as above but waiting 6 seconds, and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls back to the legacy paths, and shows valid data. wchargin-branch: data-liveness-check wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
1 parent 4273e36 commit 9a1f789

File tree

2 files changed

+18
-6
lines changed

2 files changed

+18
-6
lines changed

tensorboard/data/BUILD

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ py_library(
9595
":ingester",
9696
"//tensorboard:expect_grpc_installed",
9797
"//tensorboard:expect_pkg_resources_installed",
98+
"//tensorboard/data/proto:protos_all_py_pb2",
9899
"//tensorboard/util:tb_logging",
99100
],
100101
)

tensorboard/data/server_ingester.py

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626

2727
from tensorboard.data import grpc_provider
2828
from tensorboard.data import ingester
29+
from tensorboard.data.proto import data_provider_pb2
2930
from tensorboard.util import tb_logging
3031

3132

@@ -47,7 +48,8 @@ def __init__(self, address, *, channel_creds_type):
4748
channel_creds_type: `grpc_util.ChannelCredsType`, as passed to
4849
`--grpc_creds_type`.
4950
"""
50-
self._data_provider = _make_provider(address, channel_creds_type)
51+
stub = _make_stub(address, channel_creds_type)
52+
self._data_provider = grpc_provider.GrpcDataProvider(address, stub)
5153

5254
@property
5355
def data_provider(self):
@@ -170,13 +172,23 @@ def start(self):
170172
)
171173

172174
addr = "localhost:%d" % port
173-
self._data_provider = _make_provider(addr, self._channel_creds_type)
175+
stub = _make_stub(addr, self._channel_creds_type)
174176
logger.info(
175-
"Established connection to data server at pid %d via %s",
177+
"Opened channel to data server at pid %d via %s",
176178
popen.pid,
177179
addr,
178180
)
179181

182+
req = data_provider_pb2.GetExperimentRequest()
183+
try:
184+
stub.GetExperiment(req, timeout=5) # should be near-instant
185+
except grpc.RpcError as e:
186+
msg = "Failed to communicate with data server at %s: %s" % (addr, e)
187+
logging.warning("%s", msg)
188+
raise DataServerStartupError(msg) from e
189+
logger.info("Got valid response from data server")
190+
self._data_provider = grpc_provider.GrpcDataProvider(addr, stub)
191+
180192

181193
def _maybe_read_file(path):
182194
"""Read a file, or return `None` on ENOENT specifically."""
@@ -189,12 +201,11 @@ def _maybe_read_file(path):
189201
raise
190202

191203

192-
def _make_provider(addr, channel_creds_type):
204+
def _make_stub(addr, channel_creds_type):
193205
(creds, options) = channel_creds_type.channel_config()
194206
options.append(("grpc.max_receive_message_length", 1024 * 1024 * 256))
195207
channel = grpc.secure_channel(addr, creds, options=options)
196-
stub = grpc_provider.make_stub(channel)
197-
return grpc_provider.GrpcDataProvider(addr, stub)
208+
return grpc_provider.make_stub(channel)
198209

199210

200211
class NoDataServerError(RuntimeError):

0 commit comments

Comments
 (0)