Skip to content

Conversation

bpkroth
Copy link
Contributor

@bpkroth bpkroth commented Apr 28, 2025

Pull Request

Title

Delay entering TrialRunner context until run_trial.


Description

This is part of an attempt to try and see if can work around issues with multiprocessing.Pool needing to pickle certain objects when forking.

For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable.

Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work.

See Also #967


Type of Change

  • 🛠️ Bug fix
  • 🔄 Refactor

Testing

  • Light so far (still in draft mode)
  • Just basic existing CI tests (seems to not break anything)

Additional Notes (optional)

I think this is incomplete. To support forking inside the Scheduler and then entering the context of the given TrialRunner, we may also need to do something about the Scheduler's Storage object.

That was true, those PRs are now forthcoming. See Also #971

For now this is a draft PR to allow @jsfreischuetz and I to play with alternative organizations of #967.


This is part of an attempt to try and see if can work around issues with
`multiprocessing.Pool` needing to pickle certain objects when forking.

For instance, if the Environment is using an SshServer, we need to start an
EventLoopContext in the background to handle the SSH connections and threads are
not picklable.

Nor are file handles, DB connections, etc., so there may be other things we also
need to adjust to make this work.

See Also microsoft#967
@bpkroth bpkroth mentioned this pull request May 2, 2025
10 tasks
@bpkroth bpkroth marked this pull request as ready for review May 9, 2025 17:54
@Copilot Copilot AI review requested due to automatic review settings May 9, 2025 17:54
@bpkroth bpkroth requested a review from a team as a code owner May 9, 2025 17:54
@bpkroth bpkroth added the ready for review Ready for review label May 9, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delays entering the TrialRunner context until running a trial to better accommodate issues with multiprocessing and object pickling. Key changes include:

  • Wrapping the trial_runner execution in a context within run_trial.
  • Delaying the entry into TrialRunner contexts in the scheduler enter method.
  • Adjusting the teardown process to handle trial_runner context appropriately.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mlos_bench/mlos_bench/schedulers/sync_scheduler.py Wraps trial_runner.run_trial within a context
mlos_bench/mlos_bench/schedulers/base_scheduler.py Delays entering trial_runner contexts and adjusts exit and teardown accordingly

Copy link
Member

@motus motus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice and straightforward, although I have questions about the TrialRunner context. Do we need one for each instance of the TrialRunner? What's the life cycle? Let's merge this PR and discuss it

@bpkroth bpkroth merged commit 19028d8 into microsoft:main May 12, 2025
16 checks passed
@bpkroth bpkroth deleted the delay-enter-trial-runner-context branch May 12, 2025 20:42
@bpkroth
Copy link
Contributor Author

bpkroth commented May 12, 2025

Looks nice and straightforward, although I have questions about the TrialRunner context. Do we need one for each instance of the TrialRunner? What's the life cycle? Let's merge this PR and discuss it

The idea is roughly to create a TrialRunner for each number of parallel trials we want to support, or else for the number of workers we want to sample noise in the target system from (e.g., in the case of Tuna).

The only real reason for delaying entering the context is to allow pickling the TrialRunner when passing it to a mp.Pool worker process (nuances of how Python does multiprocessing pools).

Since an Environment could use something that requires state like a background tread (e.g., if it uses SshService so needs an EventLoopContext, which we'll eventually want for background async status polling too), then we don't want to enter the Environment context until we're ready to run it since those threads are not picklable.

From my looking around, the only thing I can think of that might be a slight overhead is that as we run multiple trials we enter and exit the context somewhat more frequently and so create and destroy those background threads.

But that happens on the order of minutes, not even seconds generally, so it's not a huge issue I think.

Aside from that teardown is executed as a separate step than leaving context anyways, so I don't think there's a huge risk there.

But happy to discuss more if you noticed something I didn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants