-
Notifications
You must be signed in to change notification settings - Fork 982
Run ABC passes in parallel #5266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WASI doesn't have threading support, so you'll need to add a way to downgrade to not using threads. There is already a |
If you want to go even further, ABC will always convert liberty libraries to internal genlib, it does not work with full liberty data. So you could actually even factor this part out so liberty files are also only proceeded once. This provides very significant time savings if you have a larger (commercial) library. |
I will test this on a few larger designs to see what it does to memory usage but I actually think this should be fine, in my experience with more 'trivial' multithreading of ABC using xargs it doesn't noticeably increase peak memory usage. Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late. |
That's a great idea. Could and probably should be done as a separate PR after this lands. |
WASI does: you need to build for the There is a good reason to keep threading support optional: it requires more hostcalls from the runtime, and at least in the browser, it requires |
705cf74
to
7367ef3
Compare
There is a performance issue I need to investigate so it's not ready for review right now. I'm not sure how to set |
I think you need to modify the Line 33 in da01e17
I'm not sure exactly how you tell visual studio how to use the pthread lib, but there is a field in there for preprocessor definitions:
It's probably fine to leave it as-is though. |
7367ef3
to
52556cf
Compare
fff515e
to
d3d557d
Compare
…odules to an `AbcConfig` struct.
…ion `extract()` Splits up the big `abc_module()` function and isolates the code that modifies the design after running ABC.
Currently `assign_map` is rebuilt from the module from scratch every time we invoke ABC. That doesn't scale when we do thousands of ABC runs over large modules. Instead, create it once and then maintain incrementally it as we update the module.
…wires in the module every time we run ABC. This does not scale when we run ABC thousands of times in a single AbcPass.
`prepare_module()` will have to run on the main thread.
Large circuits can run hundreds or thousands of ABCs in a single AbcPass. For some circuits, some of those ABC runs can run for hundreds of seconds. Running ABCs in parallel with each other and in parallel with main-thread processing (reading and writing BLIF files, copying ABC BLIF output into the design) can give large speedups.
d3d557d
to
4fd01fa
Compare
Doing ABC runs in parallel can actually make things slower when every ABC run requires spawning an ABC subprocess --- especially when using popen(), which on glibc does not use vfork(). What seems to happen is that constant fork()ing keeps making the main process data pages copy-on-write, so the main process code that is setting up each ABC call takes a lot of minor page-faults, slowing it down. The solution is pretty straightforward although a little tricky to implement. We just reuse ABC subprocesses. Instead of passing the ABC script name on the command line, we spawn an ABC REPL and pipe a command into it to source the script. When that's done we echo an `ABC_DONE` token instead of exiting. Yosys then puts the ABC process onto a stack which we can pull from the next time we do an ABC run. For one of our large designs, this is an additional 5x speedup of the primary AbcPass. It does 5155 ABC runs, all very small; runtime of the AbcPass goes from 760s to 149s (not very scientific benchmarking but the effect size is large).
4fd01fa
to
f71e9e3
Compare
I've updated the PR. Mainly I've added another commit that uses a pool of ABC processes and reuses ABC processes instead of always spawning a new one for every ABC run. This avoids some situations where doing parallel ABC runs could actually be a regression. |
Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads. |
Do you want it in this PR or a separate PR? |
What are the reasons/motivation for this change?
For large circuits
AbcPass
can run ABC hundreds or thousands of times, once per uniqueclkdomain_t
. Some of those ABC runs take a while. Running those ABCs in parallel is possible, because the cells assigned to each ABC run are disjoint. This PR improves the runtime ofAbcPass
on one of our large circuits by 5x (which translates into a 3x speedup of synthesis end-to-end).Explain how this is achieved.
This builds on PR #5239.
Reading and writing RTLIL are not thread-safe for various reasons, and fixing that would be difficult. So for now we stick with reading and writing RTLIL on the main thread. We split up the per-clkdomain work into a "prepare" phase which builds the gate netlist and removes the corresponding cells from the module, a "run ABC" phase which actually runs ABC, and an "extract" phase which processes the results of an ABC run to create fresh cells in the module. Everything happens on the main thread other than the "run ABC" phases. For parallelism, we create a set of worker threads which pull work from a concurrent queue fed by the main thread.
To simplify things and also provide a small performance boost, the writing of
lutdefs.txt
andstdcells.genlib
is factored out so it happens just once per pass instead of once per ABC run.Writing thread-safe code in C++ is very scary, especially in a large existing project like Yosys not designed for multithreading. Fortunately the "run ABC" phase is not large and mostly self-contained so the risk of this PR may be acceptable.
One thread safety problem I had to tackle was logging.
Yosys::log()
is not thread-safe and making it thread-safe would be very invasive. The obvious approach of putting locks around everything would slow down the single-threaded case and not scale well for parallel threads, plus the desired behavior of some of the logging functions w.r.t. concurrent logging calls is not clear. So I've created aDeferredLogs
class which exposes alog()
function which simply captures the logs for a particular work item into a buffer. Eventually, back on the main thread, those deferred logs are printed via the standardlog()
function. If log timing is enabled then the timestamps are not meaningful; we can fix that by extending the logging API so callers can pass in previously captured timestamps, but I'd prefer to do that after my logging PR #5243 has been merged.If applicable, please suggest to reviewers how they can test the change.
The existing Yosys test suite exercises this code fairly well. If we take this PR and especially if we carry on down the road of adding more parallelism, it would be good to run the Yosys test suite with TSAN regularly.