-
Notifications
You must be signed in to change notification settings - Fork 85
feat: Stream DGXC logs #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
hey @roclark, Hemil mentioned that you built the initial DGXC executor. Before diving too deep into the work, I wanted to prefetch your thoughts on this PR. Do you agree with the implementation design to fetch container logs? |
|
Hey @ko3n1g, I haven't worked on the Run:ai pieces in a while, but I think this makes sense. My only question would be if the API server is something that is always exposed/available by default to users on installations or if that needs to be configured by an admin beforehand. |
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
f2b1046 to
0c6b7c9
Compare
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
nemo_run/core/execution/dgxcloud.py
Outdated
| from enum import Enum | ||
| from pathlib import Path | ||
| from typing import Any, Optional, Type | ||
| from typing import Any, Dict, Iterable, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove Dict and just use dict for the types?
| role_name, | ||
| replica_id, | ||
| regex, | ||
| None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchx has 8 args, so prior to this change we were routing should_tail and streams to since and until.
Since my codepaths use streams I ran into this issue
Signed-off-by: oliver könig <[email protected]>
This adds a log-streamer to the DGXCExecutor and ties it to the frontend via the torchx scheduler.
Since the DXGC endpoint doesn't expose logs, we need to go via the kube-apiserver to download container pod logs. Since we need a token for creating that, I decided to go via the torchx scheduler that allows instantiating the executor' state.
I still need to: