-
Notifications
You must be signed in to change notification settings - Fork 73
Delay entering TrialRunner context until run_trial #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay entering TrialRunner context until run_trial #970
Conversation
This is part of an attempt to try and see if can work around issues with `multiprocessing.Pool` needing to pickle certain objects when forking. For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable. Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work. See Also microsoft#967
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR delays entering the TrialRunner context until running a trial to better accommodate issues with multiprocessing and object pickling. Key changes include:
- Wrapping the trial_runner execution in a context within run_trial.
- Delaying the entry into TrialRunner contexts in the scheduler enter method.
- Adjusting the teardown process to handle trial_runner context appropriately.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
mlos_bench/mlos_bench/schedulers/sync_scheduler.py | Wraps trial_runner.run_trial within a context |
mlos_bench/mlos_bench/schedulers/base_scheduler.py | Delays entering trial_runner contexts and adjusts exit and teardown accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice and straightforward, although I have questions about the TrialRunner
context. Do we need one for each instance of the TrialRunner
? What's the life cycle? Let's merge this PR and discuss it
The idea is roughly to create a The only real reason for delaying entering the context is to allow pickling the TrialRunner when passing it to a mp.Pool worker process (nuances of how Python does multiprocessing pools). Since an Environment could use something that requires state like a background tread (e.g., if it uses SshService so needs an EventLoopContext, which we'll eventually want for background async status polling too), then we don't want to enter the Environment context until we're ready to run it since those threads are not picklable. From my looking around, the only thing I can think of that might be a slight overhead is that as we run multiple trials we enter and exit the context somewhat more frequently and so create and destroy those background threads. But that happens on the order of minutes, not even seconds generally, so it's not a huge issue I think. Aside from that But happy to discuss more if you noticed something I didn't. |
Pull Request
Title
Delay entering
TrialRunner
context untilrun_trial
.Description
This is part of an attempt to try and see if can work around issues with
multiprocessing.Pool
needing to pickle certain objects when forking.For instance, if the Environment is using an SshServer, we need to start an EventLoopContext in the background to handle the SSH connections and threads are not picklable.
Nor are file handles, DB connections, etc., so there may be other things we also need to adjust to make this work.
See Also #967
Type of Change
Testing
Additional Notes (optional)
I think this is incomplete. To support forking inside the Scheduler and then entering the context of the given TrialRunner, we may also need to do something about the Scheduler's Storage object.
That was true, those PRs are now forthcoming. See Also #971
For now this is a draft PR to allow @jsfreischuetz and I to play with alternative organizations of #967.