Delay entering TrialRunner context until run_trial #970

bpkroth · 2025-04-28T21:48:03Z

Pull Request

Title

Delay entering TrialRunner context until run_trial.

Description

This is part of an attempt to try and see if can work around issues with multiprocessing.Pool needing to pickle certain objects when forking.

For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable.

Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work.

Type of Change

🛠️ Bug fix
🔄 Refactor

Testing

Light so far (still in draft mode)
Just basic existing CI tests (seems to not break anything)

Additional Notes (optional)

I think this is incomplete. To support forking inside the Scheduler and then entering the context of the given TrialRunner, we may also need to do something about the Scheduler's Storage object.

That was true, those PRs are now forthcoming. See Also #971

For now this is a draft PR to allow @jsfreischuetz and I to play with alternative organizations of #967.

This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also microsoft#967

Copilot

Pull Request Overview

This PR delays entering the TrialRunner context until running a trial to better accommodate issues with multiprocessing and object pickling. Key changes include:

Wrapping the trial_runner execution in a context within run_trial.
Delaying the entry into TrialRunner contexts in the scheduler enter method.
Adjusting the teardown process to handle trial_runner context appropriately.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
mlos_bench/mlos_bench/schedulers/sync_scheduler.py	Wraps trial_runner.run_trial within a context
mlos_bench/mlos_bench/schedulers/base_scheduler.py	Delays entering trial_runner contexts and adjusts exit and teardown accordingly

mlos_bench/mlos_bench/schedulers/base_scheduler.py

motus

Looks nice and straightforward, although I have questions about the TrialRunner context. Do we need one for each instance of the TrialRunner? What's the life cycle? Let's merge this PR and discuss it

bpkroth · 2025-05-12T21:01:12Z

Looks nice and straightforward, although I have questions about the TrialRunner context. Do we need one for each instance of the TrialRunner? What's the life cycle? Let's merge this PR and discuss it

The idea is roughly to create a TrialRunner for each number of parallel trials we want to support, or else for the number of workers we want to sample noise in the target system from (e.g., in the case of Tuna).

The only real reason for delaying entering the context is to allow pickling the TrialRunner when passing it to a mp.Pool worker process (nuances of how Python does multiprocessing pools).

Since an Environment could use something that requires state like a background tread (e.g., if it uses SshService so needs an EventLoopContext, which we'll eventually want for background async status polling too), then we don't want to enter the Environment context until we're ready to run it since those threads are not picklable.

From my looking around, the only thing I can think of that might be a slight overhead is that as we run multiple trials we enter and exit the context somewhat more frequently and so create and destroy those background threads.

But that happens on the order of minutes, not even seconds generally, so it's not a huge issue I think.

Aside from that teardown is executed as a separate step than leaving context anyways, so I don't think there's a huge risk there.

But happy to discuss more if you noticed something I didn't.

bpkroth mentioned this pull request May 2, 2025

WIP: Parallel Trial Scheduler #971

Draft

10 tasks

bpkroth marked this pull request as ready for review May 9, 2025 17:54

Copilot AI review requested due to automatic review settings May 9, 2025 17:54

bpkroth requested a review from a team as a code owner May 9, 2025 17:54

bpkroth added the ready for review Ready for review label May 9, 2025

Copilot AI reviewed May 9, 2025

View reviewed changes

mlos_bench/mlos_bench/schedulers/base_scheduler.py Show resolved Hide resolved

bpkroth enabled auto-merge (squash) May 9, 2025 18:41

bpkroth mentioned this pull request May 9, 2025

WIP: Prepare Experiment.load to handle async out of order trial completion #973

Draft

7 tasks

motus approved these changes May 12, 2025

View reviewed changes

bpkroth merged commit 19028d8 into microsoft:main May 12, 2025
16 checks passed

bpkroth deleted the delay-enter-trial-runner-context branch May 12, 2025 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delay entering TrialRunner context until run_trial #970

Delay entering TrialRunner context until run_trial #970

Uh oh!

bpkroth commented Apr 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

motus left a comment

Uh oh!

Uh oh!

bpkroth commented May 12, 2025

Uh oh!

Uh oh!

Delay entering TrialRunner context until run_trial #970

Delay entering TrialRunner context until run_trial #970

Uh oh!

Conversation

bpkroth commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Title

Description

Type of Change

Testing

Additional Notes (optional)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

motus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bpkroth commented May 12, 2025

Uh oh!

Uh oh!

bpkroth commented Apr 28, 2025 •

edited

Loading