Skip to content

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Nov 5, 2025

This adds a log-streamer to the DGXCExecutor and ties it to the frontend via the torchx scheduler.

Since the DXGC endpoint doesn't expose logs, we need to go via the kube-apiserver to download container pod logs. Since we need a token for creating that, I decided to go via the torchx scheduler that allows instantiating the executor' state.

I still need to:

  • test this e2e with a DGXC environment
  • add unit tests

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Nov 6, 2025

hey @roclark, Hemil mentioned that you built the initial DGXC executor. Before diving too deep into the work, I wanted to prefetch your thoughts on this PR. Do you agree with the implementation design to fetch container logs?

@roclark
Copy link
Contributor

roclark commented Nov 7, 2025

Hey @ko3n1g, I haven't worked on the Run:ai pieces in a while, but I think this makes sense. My only question would be if the API server is something that is always exposed/available by default to users on installations or if that needs to be configured by an admin beforehand.

Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: oliver könig <[email protected]>
from enum import Enum
from pathlib import Path
from typing import Any, Optional, Type
from typing import Any, Dict, Iterable, Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Dict and just use dict for the types?

role_name,
replica_id,
regex,
None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchx has 8 args, so prior to this change we were routing should_tail and streams to since and until.

Since my codepaths use streams I ran into this issue

Signed-off-by: oliver könig <[email protected]>
@ko3n1g ko3n1g merged commit 6b2240e into main Nov 20, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants