-
Notifications
You must be signed in to change notification settings - Fork 256
Add @inbounds to SparseArrays.nzrange for CuSparseDeviceMatrixCSC #2970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/lib/cusparse/device.jl b/lib/cusparse/device.jl
index 6fa563552..dea10c5f6 100644
--- a/lib/cusparse/device.jl
+++ b/lib/cusparse/device.jl
@@ -37,7 +37,7 @@ SparseArrays.nnz(g::CuSparseDeviceMatrixCSC) = g.nnz
SparseArrays.rowvals(g::CuSparseDeviceMatrixCSC) = g.rowVal
SparseArrays.getcolptr(g::CuSparseDeviceMatrixCSC) = g.colPtr
SparseArrays.getnzval(g::CuSparseDeviceMatrixCSC) = g.nzVal
-SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) = @inbounds SparseArrays.getcolptr(g)[col]:(SparseArrays.getcolptr(g)[col+1]-1)
+SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) = @inbounds SparseArrays.getcolptr(g)[col]:(SparseArrays.getcolptr(g)[col + 1] - 1)
SparseArrays.nonzeros(g::CuSparseDeviceMatrixCSC) = g.nzVal
const CuSparseDeviceColumnView{Tv, Ti} = SubArray{Tv, 1, <:CuSparseDeviceMatrixCSC{Tv, Ti}, Tuple{Base.Slice{Base.OneTo{Int}}, Int}} |
|
Just fyi these types are probably going away soon -- I honestly don't remember if I had |
|
I'll cross that bridge when I need to. The GPUArrays.jl sparse device arrays look to be a drop-in replacement, but I also have some quite complex kernels in IntervalMDP.jl operating on CuSparseMatrixCSC, and therefore, I expect to run into issues when transitioning - and I most certainly don't have the time to fix that atm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
| Benchmark suite | Current: 48ee0d2 | Previous: 2e983fe | Ratio |
|---|---|---|---|
latency/precompile |
56920777390 ns |
56427085830.5 ns |
1.01 |
latency/ttfp |
8317020437.5 ns |
8362501410 ns |
0.99 |
latency/import |
4498206426 ns |
4521778039 ns |
0.99 |
integration/volumerhs |
9624244.5 ns |
9624952.5 ns |
1.00 |
integration/byval/slices=1 |
147221 ns |
146870 ns |
1.00 |
integration/byval/slices=3 |
426220 ns |
425790 ns |
1.00 |
integration/byval/reference |
145136 ns |
144866 ns |
1.00 |
integration/byval/slices=2 |
286528.5 ns |
286021 ns |
1.00 |
integration/cudadevrt |
103621 ns |
103323 ns |
1.00 |
kernel/indexing |
14182 ns |
14090 ns |
1.01 |
kernel/indexing_checked |
14943 ns |
14977.5 ns |
1.00 |
kernel/occupancy |
691.2333333333333 ns |
670.5886075949367 ns |
1.03 |
kernel/launch |
2199.3333333333335 ns |
2115.8 ns |
1.04 |
kernel/rand |
18671 ns |
16842 ns |
1.11 |
array/reverse/1d |
19950 ns |
19633 ns |
1.02 |
array/reverse/2dL_inplace |
66907 ns |
66698 ns |
1.00 |
array/reverse/1dL |
70183 ns |
69881 ns |
1.00 |
array/reverse/2d |
21758 ns |
21367 ns |
1.02 |
array/reverse/1d_inplace |
9638 ns |
9601 ns |
1.00 |
array/reverse/2d_inplace |
13400 ns |
13220 ns |
1.01 |
array/reverse/2dL |
73895 ns |
73483 ns |
1.01 |
array/reverse/1dL_inplace |
66908 ns |
66751 ns |
1.00 |
array/copy |
20830 ns |
20712 ns |
1.01 |
array/iteration/findall/int |
157172.5 ns |
156846 ns |
1.00 |
array/iteration/findall/bool |
139875 ns |
139935.5 ns |
1.00 |
array/iteration/findfirst/int |
161160 ns |
160606 ns |
1.00 |
array/iteration/findfirst/bool |
161959 ns |
161405 ns |
1.00 |
array/iteration/scalar |
73837 ns |
72218 ns |
1.02 |
array/iteration/logical |
216845.5 ns |
215761.5 ns |
1.01 |
array/iteration/findmin/1d |
50519 ns |
49669 ns |
1.02 |
array/iteration/findmin/2d |
96561.5 ns |
96275.5 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43137 ns |
43492 ns |
0.99 |
array/reductions/reduce/Int64/dims=1 |
44422 ns |
44664.5 ns |
0.99 |
array/reductions/reduce/Int64/dims=2 |
61610 ns |
61641 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
88803 ns |
88640 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
87915.5 ns |
87635.5 ns |
1.00 |
array/reductions/reduce/Float32/1d |
37370 ns |
36681 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
51991 ns |
48806 ns |
1.07 |
array/reductions/reduce/Float32/dims=2 |
59879 ns |
59459 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
52434 ns |
52065 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
72183 ns |
71664 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
43399 ns |
43256 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
44879.5 ns |
44863 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
61690 ns |
61500 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
89152 ns |
88638 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
88167 ns |
87897.5 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
37366 ns |
36277.5 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=1 |
48730 ns |
41259 ns |
1.18 |
array/reductions/mapreduce/Float32/dims=2 |
60130 ns |
59440 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
52689 ns |
52331.5 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2L |
72524.5 ns |
71656.5 ns |
1.01 |
array/broadcast |
20031 ns |
19817 ns |
1.01 |
array/copyto!/gpu_to_gpu |
12986 ns |
11436 ns |
1.14 |
array/copyto!/cpu_to_gpu |
214181 ns |
215179 ns |
1.00 |
array/copyto!/gpu_to_cpu |
282787 ns |
282618 ns |
1.00 |
array/accumulate/Int64/1d |
124556 ns |
124273 ns |
1.00 |
array/accumulate/Int64/dims=1 |
83102 ns |
83182 ns |
1.00 |
array/accumulate/Int64/dims=2 |
157715 ns |
157485 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1709359 ns |
1709450 ns |
1.00 |
array/accumulate/Int64/dims=2L |
966166 ns |
966304 ns |
1.00 |
array/accumulate/Float32/1d |
109001 ns |
108932 ns |
1.00 |
array/accumulate/Float32/dims=1 |
80062 ns |
80065 ns |
1.00 |
array/accumulate/Float32/dims=2 |
147542.5 ns |
146929 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1618641 ns |
1618534.5 ns |
1.00 |
array/accumulate/Float32/dims=2L |
698238 ns |
697506 ns |
1.00 |
array/construct |
1282.2 ns |
1270.6 ns |
1.01 |
array/random/randn/Float32 |
45167.5 ns |
47947 ns |
0.94 |
array/random/randn!/Float32 |
24926 ns |
24918 ns |
1.00 |
array/random/rand!/Int64 |
27311 ns |
27167 ns |
1.01 |
array/random/rand!/Float32 |
8903.666666666666 ns |
8884.333333333334 ns |
1.00 |
array/random/rand/Int64 |
29812 ns |
37695.5 ns |
0.79 |
array/random/rand/Float32 |
13240.5 ns |
12943 ns |
1.02 |
array/permutedims/4d |
55770.5 ns |
59797.5 ns |
0.93 |
array/permutedims/2d |
54046 ns |
53660 ns |
1.01 |
array/permutedims/3d |
54951 ns |
54666 ns |
1.01 |
array/sorting/1d |
2757753 ns |
2757791.5 ns |
1.00 |
array/sorting/by |
3344532 ns |
3344326 ns |
1.00 |
array/sorting/2d |
1080947 ns |
1080588 ns |
1.00 |
cuda/synchronization/stream/auto |
1044.8 ns |
1040 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
7844.4 ns |
6879.299999999999 ns |
1.14 |
cuda/synchronization/stream/blocking |
857.6823529411764 ns |
805.0612244897959 ns |
1.07 |
cuda/synchronization/context/auto |
1196 ns |
1175.2 ns |
1.02 |
cuda/synchronization/context/nonblocking |
7480 ns |
7439.7 ns |
1.01 |
cuda/synchronization/context/blocking |
932.9230769230769 ns |
896.560975609756 ns |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
Add @inbounds to SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) to avoid bounds checking the
colPtr(that was causing me much register spilling into local mem), consistent with the new SparseArrays functionality of GPUArrays.jl.