-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I've come across what looks like an anomalous difference in performance inverting a positive definite matrix using dpotrf() and dpotri(), on Windows as compared with Linux. This is on a dual-boot SkylakeX laptop, using OpenBLAS 0.3.28, compiled with gcc 14.2.0 on Arch Linux and cross-compiled with x86_64-mingw-w64-mingw32-gcc 14.2.0 for Windows 11, in both cases using OpenMP for threading. The configuration flags are mostly the same for the two OpenBLAS builds, except that the Windows build uses DYNAMIC_ARCH=1 but the Linux one is left to auto-detect SkylakeX.
The context is a Gibbs sampling operation with many thousands of iterations, so the performance difference becomes very striking. My test rig iterates inversion of a sequence of p.d. matrices of moderate size, from dimension 4 to 64 by powers of 2. Given the moderate size, multi-threading is not really worthwhile. Best performance is achieved by setting OMP_NUM_THREADS=1; in that case the rig runs very fast on both platforms, with Windows marginally slower than Linux. But if I set the number of OMP threads to equal the number of physical cores (4), which is the default in the program I'm working with,
- there's just a slight degradation of performance on Linux, but
- the performance on Windows becomes really horrible, 10 or more times slower than Linux.
I'd be very grateful if anyone can offer insight into what might be going on here. I'd be happy to supply more details depending on what might be relevant.