You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
data: check liveness before blessing data server (#4851)
Summary:
Issue #4844 shows that there are circumstances when communication
between TensorBoard and the local data server process cannot be
established. As of this patch, we send a trivial RPC to the data server
and wait for its response before committing to use the new loading
paths. This adds about 1–5ms total time to the happy path on my machine,
with a worst case penalty of 5s due to the timeout. If the server is not
reachable, we print a warning and fall back to the legacy paths.
Test Plan:
- Test with normal working TensorBoard.
- Simulate failed data server connection by changing the definition of
`addr` to `"localhost:%d" % (port + 777)` on line 175. This should
print the `UNAVAILABLE`/“failed to connect” message from #4844.
- Simulate slow data server by adding the following to `cli.rs`:
```rust
tokio::time::sleep(Duration::from_secs(3)).await;
```
Add this right before the `Server::builder().(...).await?` call at
the end of `main`: i.e., after we write the port file, but before we
actually respond to requests. Note that TensorBoard still works with
a 3-second delay, but that it actually delays printing the startup
message for those 3 seconds as it determines which data provider to
use.
- Simulate extra-slow data provider as above but waiting 6 seconds,
and note that TensorBoard prints a `DEADLINE_EXCEEDED` error, falls
back to the legacy paths, and shows valid data.
wchargin-branch: data-liveness-check
wchargin-source: 7fd916ebcf426d92babaeb57c38789d47446df6e
0 commit comments