Skip to content

Conversation

@wanyunSu
Copy link
Contributor

@wanyunSu wanyunSu commented Nov 27, 2025

Description

  1. Bring logs back so it can be seen by logs command and on k8s dashboard.
  2. Fix [Bug]: K8s: root-controller can't communicate from non-localhost #701 by binding root-controller's gRPC server to pod ip.
    To test it: change the config to use an actual hostname other than 'localhost'.
  3. Fix [Bug]: k8s: cannot restart session in the same unified shell #668 by refactoring the create namespace function to avoid k8s race condition.
    Now the order is : Terminating → wait until 404 → create namespace → wait Active → create pods
  4. Fix [Bug]: K8s: headless service port is hardcoded #715 , the port number is given from the pm config.
  5. Inject three env vars for all pods: $DOTDRUNC, $USER and $HOME via a function _build_container_env.

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@wanyunSu wanyunSu self-assigned this Nov 27, 2025
@PawelPlesniak PawelPlesniak self-requested a review November 27, 2025 16:20
@PawelPlesniak
Copy link
Collaborator

@wanyunSu please fill in the metadata

@MRiganSUSX MRiganSUSX self-requested a review December 3, 2025 14:57
Copy link
Contributor

@MRiganSUSX MRiganSUSX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wanyunSu ,
Thank you for making these improvements. I have comments, ordered by priority:

  • Since you changed the command for log redirects to use process substitution, which is a Bash feature, we must change the container command interpreter to bash.
    command=["/bin/bash", "-c"],
  • For the pod_ip injection, the regex is too aggressive. It works for this case, but there could be changes in the future where we pass multiple different parameters to the cmd, and this would replace all of them. It should be better aimed for the url it is trying to change (ie localhost, 127. ips etc). Something like this would be more appropriate:
modified_arg = re.sub(
    r"([a-zA-Z]+://)(localhost|0\.0\.0\.0|127\.0\.0\.1)(:\d+)", 
    r"\g<1>${POD_IP}\g<3>", 
    arg
)
  • the namespace deletion timeout should use its own (separate) timeout var
  • regarding the 'service port': this is too generic, there are many services around and we should make the naming clear. It should be immediately identifiable, such as headless_discovery_port or similar. This is both in the PM but also for the json config.
  • you nicely added the usage of labels, but with your change it is now being calculated and applied three times! both in _create_pod and in _build_pod_main_container. This is also clear from the logs:
Screenshot from 2025-12-03 16-48-57

this should be calculated / applied once, and then passed on as needed.

  • minor: the function _ensure_namespace_exists does not actually 'ensure' that the namespace exists. It simply checks if is possible to create a namespace (there isn't one rn that is not terminating). I suggest renaming to something that better describes what the function does, ie _prepare_new_session_namespace or similar.
  • function _build_pod_main_container is a lot more complex now. I suggest refactoring to factor out the new env setting functionality, something like:
def _build_container_env(self, boot_request, tree_labels) -> list[client.V1EnvVar]:
  • in addition to the $USER var, do we need to also pass $HOME ? currently it is empty inside the pod (defaulted to "/").

@wanyunSu
Copy link
Contributor Author

wanyunSu commented Dec 4, 2025

Thanks @MRiganSUSX for the comments! Please find below the resolution for each of them:

  1. change the container command to bash.
  2. For the pod_ip injection, the regex is now using grpc protocol. (localhost, 127. have been handled here)
  3. namespace deletion timeout is using restart timeout as it's part of restart procedure.
  4. 'service port' is changed to 'service headless_discovery_port'.
  5. clean up the usage of labels.
  6. the function ensure_namespace_exists is now prepare_namespace.
  7. add function _build_container_env for function _build_pod_main_container.
  8. inject $HOME to the pod env, the address is passed from the pm config home_path_base + $USER

@wanyunSu wanyunSu requested a review from MRiganSUSX December 4, 2025 11:27
@wanyunSu wanyunSu mentioned this pull request Dec 11, 2025
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants