Skip to content

Conversation

rocallahan
Copy link
Contributor

What are the reasons/motivation for this change?

For large circuits AbcPass can run ABC hundreds or thousands of times, once per unique clkdomain_t. Some of those ABC runs take a while. Running those ABCs in parallel is possible, because the cells assigned to each ABC run are disjoint. This PR improves the runtime of AbcPass on one of our large circuits by 5x (which translates into a 3x speedup of synthesis end-to-end).

Explain how this is achieved.

This builds on PR #5239.

Reading and writing RTLIL are not thread-safe for various reasons, and fixing that would be difficult. So for now we stick with reading and writing RTLIL on the main thread. We split up the per-clkdomain work into a "prepare" phase which builds the gate netlist and removes the corresponding cells from the module, a "run ABC" phase which actually runs ABC, and an "extract" phase which processes the results of an ABC run to create fresh cells in the module. Everything happens on the main thread other than the "run ABC" phases. For parallelism, we create a set of worker threads which pull work from a concurrent queue fed by the main thread.

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

Writing thread-safe code in C++ is very scary, especially in a large existing project like Yosys not designed for multithreading. Fortunately the "run ABC" phase is not large and mostly self-contained so the risk of this PR may be acceptable.

One thread safety problem I had to tackle was logging. Yosys::log() is not thread-safe and making it thread-safe would be very invasive. The obvious approach of putting locks around everything would slow down the single-threaded case and not scale well for parallel threads, plus the desired behavior of some of the logging functions w.r.t. concurrent logging calls is not clear. So I've created a DeferredLogs class which exposes a log() function which simply captures the logs for a particular work item into a buffer. Eventually, back on the main thread, those deferred logs are printed via the standard log() function. If log timing is enabled then the timestamps are not meaningful; we can fix that by extending the logging API so callers can pass in previously captured timestamps, but I'd prefer to do that after my logging PR #5243 has been merged.

If applicable, please suggest to reviewers how they can test the change.

The existing Yosys test suite exercises this code fairly well. If we take this PR and especially if we carry on down the road of adding more parallelism, it would be good to run the Yosys test suite with TSAN regularly.

@KrystalDelusion
Copy link
Member

WASI doesn't have threading support, so you'll need to add a way to downgrade to not using threads. There is already a DISABLE_ABC_THREADS make option that disables pthread when building ABC, which may still make sense here, but it may also be better to have an ENABLE_THREADS feature option which can be shared with anything else that ends up implementing threads.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2025

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

If you want to go even further, ABC will always convert liberty libraries to internal genlib, it does not work with full liberty data. So you could actually even factor this part out so liberty files are also only proceeded once. This provides very significant time savings if you have a larger (commercial) library.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2025

I will test this on a few larger designs to see what it does to memory usage but I actually think this should be fine, in my experience with more 'trivial' multithreading of ABC using xargs it doesn't noticeably increase peak memory usage.

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

@rocallahan
Copy link
Contributor Author

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

That's a great idea. Could and probably should be done as a separate PR after this lands.

@whitequark
Copy link
Member

whitequark commented Aug 4, 2025

WASI doesn't have threading support,

WASI does: you need to build for the wasm32-wasip1-threads target instead of wasm32-wasip1. Since Yosys is single-threaded I didn't bother adding a compile-time option, but it may be worth it checking if threads are supported using #ifdef _REENTRANT.

There is a good reason to keep threading support optional: it requires more hostcalls from the runtime, and at least in the browser, it requires SharedArrayBuffer support, which means you need to have some quite annoying workarounds, and deploying the Wasm build from e.g. GitHub Pages becomes very tricky as it doesn't send the right CORS headers.

@ShinyKate ShinyKate requested a review from widlarizer August 4, 2025 18:49
@rocallahan rocallahan force-pushed the abc-parallel branch 3 times, most recently from 705cf74 to 7367ef3 Compare August 5, 2025 09:45
@rocallahan rocallahan marked this pull request as draft August 5, 2025 09:45
@rocallahan
Copy link
Contributor Author

There is a performance issue I need to investigate so it's not ready for review right now.

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

@KrystalDelusion
Copy link
Member

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

I think you need to modify the .vcxproj file, similar to how the cpp standard is overridden:

sed -i 's,</AdditionalIncludeDirectories>,</AdditionalIncludeDirectories>\n <LanguageStandard>stdcpp17</LanguageStandard>\n <AdditionalOptions>/Zc:__cplusplus %(AdditionalOptions)</AdditionalOptions>,g' "$vcxsrc"/YosysVS/YosysVS.vcxproj.new

I'm not sure exactly how you tell visual studio how to use the pthread lib, but there is a field in there for preprocessor definitions:

<PreprocessorDefinitions>_YOSYS_;_CRT_SECURE_NO_WARNINGS;WIN32;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>

It's probably fine to leave it as-is though.

…ion `extract()`

Splits up the big `abc_module()` function and isolates the code that modifies the design
after running ABC.
Currently `assign_map` is rebuilt from the module from scratch every time we invoke ABC.
That doesn't scale when we do thousands of ABC runs over large modules. Instead,
create it once and then maintain incrementally it as we update the module.
…wires in the module every time we run ABC.

This does not scale when we run ABC thousands of times in a single AbcPass.
`prepare_module()` will have to run on the main thread.
Large circuits can run hundreds or thousands of ABCs in a single AbcPass.
For some circuits, some of those ABC runs can run for hundreds of seconds.
Running ABCs in parallel with each other and in parallel with main-thread
processing (reading and writing BLIF files, copying ABC BLIF output into
the design) can give large speedups.
Doing ABC runs in parallel can actually make things slower when every ABC run requires
spawning an ABC subprocess --- especially when using popen(), which on glibc does not
use vfork(). What seems to happen is that constant fork()ing keeps making the main
process data pages copy-on-write, so the main process code that is setting up each ABC
call takes a lot of minor page-faults, slowing it down.

The solution is pretty straightforward although a little tricky to implement.
We just reuse ABC subprocesses. Instead of passing the ABC script name on the command
line, we spawn an ABC REPL and pipe a command into it to source the script. When that's
done we echo an `ABC_DONE` token instead of exiting. Yosys then puts the ABC process
onto a stack which we can pull from the next time we do an ABC run.

For one of our large designs, this is an additional 5x speedup of the primary AbcPass.
It does 5155 ABC runs, all very small; runtime of the AbcPass goes from 760s to 149s
(not very scientific benchmarking but the effect size is large).
@rocallahan rocallahan marked this pull request as ready for review August 13, 2025 21:30
@rocallahan
Copy link
Contributor Author

rocallahan commented Aug 13, 2025

I've updated the PR. Mainly I've added another commit that uses a pool of ABC processes and reuses ABC processes instead of always spawning a new one for every ABC run. This avoids some situations where doing parallel ABC runs could actually be a regression.

@phsauter
Copy link
Contributor

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads.
Its not a must have but a nice to have imo.

@rocallahan
Copy link
Contributor Author

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads. Its not a must have but a nice to have imo.

Do you want it in this PR or a separate PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants