|
1 | 1 | # x86-simd-sort |
2 | 2 |
|
3 | | -C++ header file library for SIMD based 16-bit, 32-bit and 64-bit data type |
4 | | -sorting algorithms on x86 processors. Source header files are available in src |
5 | | -directory. We currently only have AVX-512 based implementation of quicksort, |
6 | | -argsort, quickselect, paritalsort and key-value sort. This repository also |
7 | | -includes a test suite which can be built and run to test the sorting algorithms |
8 | | -for correctness. It also has benchmarking code to compare its performance |
9 | | -relative to std::sort. The following API's are currently supported: |
10 | | - |
11 | | -#### Quicksort |
12 | | - |
13 | | -```cpp |
14 | | -void avx512_qsort<T>(T* arr, int64_t arrsize) |
15 | | -``` |
16 | | -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
17 | | -`float`, `uint64_t`, `int64_t` and `double`. |
18 | | - |
19 | | -For floating-point types, if `arr` contains NaNs, they are moved to the end and |
20 | | -replaced with a quiet NaN. That is, the original, bit-exact NaNs in the input |
21 | | -are not preserved. |
22 | | - |
23 | | -#### Argsort |
| 3 | +C++ template library for high performance SIMD based sorting routines for |
| 4 | +16-bit, 32-bit and 64-bit data types. The sorting routines are accelerated |
| 5 | +using AVX-512/AVX2 when available. The library auto picks the best version |
| 6 | +depending on the processor it is run on. If you are looking for the AVX-512 or |
| 7 | +AVX2 specific implementations, please see |
| 8 | +[README](https://github.com/intel/x86-simd-sort/src/README.md) file under |
| 9 | +`src/` directory. The following routines are currently supported: |
24 | 10 |
|
25 | 11 | ```cpp |
26 | | -std::vector<int64_t> arg = avx512_argsort<T>(T* arr, int64_t arrsize) |
27 | | -void avx512_argsort<T>(T* arr, int64_t *arg, int64_t arrsize) |
| 12 | +x86simdsort::qsort(T* arr, size_t size, bool hasnan); |
| 13 | +x86simdsort::qselect(T* arr, size_t k, size_t size, bool hasnan); |
| 14 | +x86simdsort::partial_qsort(T* arr, size_t k, size_t size, bool hasnan); |
| 15 | +std::vector<size_t> arg = x86simdsort::argsort(T* arr, size_t size, bool hasnan); |
| 16 | +std::vector<size_t> arg = x86simdsort::argselect(T* arr, size_t k, size_t size, bool hasnan); |
28 | 17 | ``` |
29 | | -Supported datatypes: `uint32_t`, `int32_t`, `float`, `uint64_t`, `int64_t` and |
30 | | -`double`. |
31 | 18 |
|
32 | | -The algorithm resorts to scalar `std::sort` if the array contains NaNs. |
| 19 | +### Build/Install |
33 | 20 |
|
34 | | -#### Quickselect |
| 21 | +[meson](https://github.com/mesonbuild/meson) is the used build system. Command |
| 22 | +to build and install the library: |
35 | 23 |
|
36 | | -```cpp |
37 | | -void avx512_qselect<T>(T* arr, int64_t arrsize) |
38 | | -void avx512_qselect<T>(T* arr, int64_t arrsize, bool hasnan) |
39 | | -``` |
40 | | -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
41 | | -`float`, `uint64_t`, `int64_t` and `double`. |
42 | | - |
43 | | -For floating-point types, if `bool hasnan` is set, NaNs are moved to the end of |
44 | | -the array, preserving the bit-exact NaNs in the input. If NaNs are present but |
45 | | -`hasnan` is `false`, the behavior is undefined. |
46 | | - |
47 | | -#### Partialsort |
48 | | - |
49 | | -```cpp |
50 | | -void avx512_partial_qsort<T>(T* arr, int64_t arrsize) |
51 | | -void avx512_partial_qsort<T>(T* arr, int64_t arrsize, bool hasnan) |
52 | 24 | ``` |
53 | | -Supported datatypes: `uint16_t`, `int16_t`, `_Float16`, `uint32_t`, `int32_t`, |
54 | | -`float`, `uint64_t`, `int64_t` and `double`. |
55 | | - |
56 | | -For floating-point types, if `bool hasnan` is set, NaNs are moved to the end of |
57 | | -the array, preserving the bit-exact NaNs in the input. If NaNs are present but |
58 | | -`hasnan` is `false`, the behavior is undefined. |
59 | | - |
60 | | -#### Key-value sort |
61 | | -```cpp |
62 | | -void avx512_qsort_kv<T>(T* key, uint64_t* value , int64_t arrsize) |
| 25 | +meson setup --buildtype release builddir && cd builddir |
| 26 | +meson compile |
| 27 | +sudo meson install |
63 | 28 | ``` |
64 | | -Supported datatypes: `uint64_t, int64_t and double` |
65 | 29 |
|
66 | | -## Algorithm details |
| 30 | +Once installed, you can use `pkg-config --cflags --libs x86simdsortcpp` to |
| 31 | +populate the right cflags and ldflags to compile and link your C++ program. |
| 32 | +This repository also contains a test suite and benchmarking suite which are |
| 33 | +written using [googletest](https://github.com/google/googletest) and [google |
| 34 | +benchmark](https://github.com/google/benchmark) frameworks respectively. You |
| 35 | +can configure meson to build them both by using `-Dbuild_tests=true` and |
| 36 | +`-Dbuild_benchmarks=true`. |
67 | 37 |
|
68 | | -The ideas and code are based on these two research papers [1] and [2]. On a |
69 | | -high level, the idea is to vectorize quicksort partitioning using AVX-512 |
70 | | -compressstore instructions. If the array size is < 128, then use Bitonic |
71 | | -sorting network implemented on 512-bit registers. The precise network |
72 | | -definitions depend on the size of the dtype and are defined in separate files: |
73 | | -`avx512-16bit-qsort.hpp`, `avx512-32bit-qsort.hpp` and |
74 | | -`avx512-64bit-qsort.hpp`. Article [4] is a good resource for bitonic sorting |
75 | | -network. The core implementations of the vectorized qsort functions |
76 | | -`avx512_qsort<T>(T*, int64_t)` are modified versions of avx2 quicksort |
77 | | -presented in the paper [2] and source code associated with that paper [3]. |
78 | | - |
79 | | -## Example to include and build this in a C++ code |
80 | | - |
81 | | -### Sample code `main.cpp` |
| 38 | +### Example usage |
82 | 39 |
|
83 | 40 | ```cpp |
84 | | -#include "src/avx512-32bit-qsort.hpp" |
| 41 | +#include "x86simdsort.h" |
85 | 42 |
|
86 | 43 | int main() { |
87 | | - const int ARRSIZE = 1000; |
88 | | - std::vector<float> arr; |
89 | | - |
90 | | - /* Initialize elements is reverse order */ |
91 | | - for (int ii = 0; ii < ARRSIZE; ++ii) { |
92 | | - arr.push_back(ARRSIZE - ii); |
93 | | - } |
94 | | - |
95 | | - /* call avx512 quicksort */ |
96 | | - avx512_qsort(arr.data(), ARRSIZE); |
| 44 | + std::vector<float> arr{1000}; |
| 45 | + x86simdsort::qsort(arr, 1000, true); |
97 | 46 | return 0; |
98 | 47 | } |
99 | | - |
100 | | -``` |
101 | | - |
102 | | -### Build using gcc |
103 | | - |
104 | | -``` |
105 | | -g++ main.cpp -mavx512f -mavx512dq -O3 |
106 | 48 | ``` |
107 | 49 |
|
108 | | -This is a header file only library and we do not provide any compile time and |
109 | | -run time checks which is recommended while including this your source code. A |
110 | | -slightly modified version of this source code has been contributed to |
111 | | -[NumPy](https://github.com/numpy/numpy) (see this [pull |
112 | | -request](https://github.com/numpy/numpy/pull/22315) for details). This NumPy |
113 | | -pull request is a good reference for how to include and build this library with |
114 | | -your source code. |
115 | | - |
116 | | -## Build requirements |
117 | | - |
118 | | -None, its header files only. However you will need `make` or `meson` to build |
119 | | -the unit tests and benchmarking suite. You will need a relatively modern |
120 | | -compiler to build. |
121 | | - |
122 | | -``` |
123 | | -gcc >= 8.x |
124 | | -``` |
125 | | - |
126 | | -### Build using Meson |
127 | | - |
128 | | -meson is the recommended build system to build the test and benchmark suite. |
129 | | - |
130 | | -``` |
131 | | -meson setup builddir && cd builddir && ninja |
132 | | -``` |
133 | | - |
134 | | -It build two executables: |
135 | | - |
136 | | -- `testexe`: runs a bunch of tests written in ./tests directory. |
137 | | -- `benchexe`: measures performance of these algorithms for various data types. |
138 | | - |
139 | | - |
140 | | -### Build using Make |
141 | | - |
142 | | -Makefile uses `-march=sapphirerapids` as a global compile flag and hence it |
143 | | -will require g++-12. `make` command builds two executables: |
144 | | -- `testexe`: runs a bunch of tests written in ./tests directory. |
145 | | -- `benchexe`: measures performance of these algorithms for various data types |
146 | | - and compares them to std::sort. |
147 | | - |
148 | | -You can use `make test` and `make bench` to build just the `testexe` and |
149 | | -`benchexe` respectively. |
150 | | - |
151 | | -## Requirements and dependencies |
152 | | - |
153 | | -The sorting routines relies only on the C++ Standard Library and requires a |
154 | | -relatively modern compiler to build (gcc 8.x and above). Since they use the |
155 | | -AVX-512 instruction set, they can only run on processors that have AVX-512. |
156 | | -Specifically, the 32-bit and 64-bit require AVX-512F and AVX-512DQ instruction |
157 | | -set. The 16-bit sorting requires the AVX-512F, AVX-512BW and AVX-512 VMBI2 |
158 | | -instruction set. The test suite is written using the Google test framework. The |
159 | | -benchmark is written using the google benchmark framework. |
160 | | - |
161 | | -## References |
162 | | - |
163 | | -* [1] Fast and Robust Vectorized In-Place Sorting of Primitive Types |
164 | | - https://drops.dagstuhl.de/opus/volltexte/2021/13775/ |
165 | | - |
166 | | -* [2] A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel |
167 | | -Skylake https://arxiv.org/pdf/1704.08579.pdf |
168 | | - |
169 | | -* [3] https://github.com/simd-sorting/fast-and-robust: SPDX-License-Identifier: MIT |
170 | | - |
171 | | -* [4] http://mitp-content-server.mit.edu:18180/books/content/sectbyfn?collid=books_pres_0&fn=Chapter%2027.pdf&id=8030 |
172 | 50 |
|
173 | | -* [5] https://bertdobbelaere.github.io/sorting_networks.html |
| 51 | +### Details |
| 52 | + |
| 53 | +- `x86simdsort::qsort` is equivalent to `qsort` in |
| 54 | + [C](https://www.tutorialspoint.com/c_standard_library/c_function_qsort.htm) |
| 55 | + or `std::sort` in [C++](https://en.cppreference.com/w/cpp/algorithm/sort). |
| 56 | +- `x86simdsort::qselect` is equivalent to `std::nth_element` in |
| 57 | + [C++](https://en.cppreference.com/w/cpp/algorithm/nth_element) or |
| 58 | + `np.partition` in |
| 59 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.partition.html). |
| 60 | +- `x86simdsort::partial_qsort` is equivalent to `std::partial_sort` in |
| 61 | + [C++](https://en.cppreference.com/w/cpp/algorithm/partial_sort). |
| 62 | +- `x86simdsort::argsort` is equivalent to `np.argsort` in |
| 63 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html). |
| 64 | +- `x86simdsort::argselect` is equivalent to `np.argpartition` in |
| 65 | + [NumPy](https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html). |
| 66 | + |
| 67 | +Supported datatypes: `uint16_t, int16_t, _Float16, uint32_t, int32_t, float, |
| 68 | +uint64_t, int64_t, double`. Note that `_Float16` will require building this |
| 69 | +library with g++ >= 12.x. All the functions have an optional argument `bool |
| 70 | +hasnan` set to `false` by default (these are relevant to floating point data |
| 71 | +types only). If your array has NAN's, the the behaviour of the sorting routine |
| 72 | +is undefined. If `hasnan` is set to true, NAN's are always sorted to the end of |
| 73 | +the array. In addition to that, qsort will replace all your NAN's with |
| 74 | +`std::numeric_limits<T>::quiet_NaN`. The original bit-exact NaNs in |
| 75 | +the input are not preserved. Also note that the arg methods (argsort and |
| 76 | +argselect) will not use the SIMD based algorithms if they detect NAN's in the |
| 77 | +array. You can read details of all the implementations |
| 78 | +[here](https://github.com/intel/x86-simd-sort/src/README.md). |
| 79 | + |
| 80 | +### Downstream projects using x86-simd-sort |
| 81 | + |
| 82 | +- NumPy uses this as a [submodule](https://github.com/numpy/numpy/pull/22315) to accelerate `np.sort, np.argsort, np.partition and np.argpartition`. |
| 83 | +- A slightly modifed version this library has been integrated into [openJDK](https://github.com/openjdk/jdk/pull/14227). |
| 84 | +- [GRAPE](https://github.com/alibaba/libgrape-lite.git): C++ library for parallel graph processing. |
| 85 | +- AVX-512 version of the key-value sort has been submitted to [Oceanbase](https://github.com/oceanbase/oceanbase/pull/1325). |
0 commit comments