Skip to content

Conversation

@Zinoex
Copy link
Contributor

@Zinoex Zinoex commented Nov 13, 2025

Add @inbounds to SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) to avoid bounds checking the colPtr (that was causing me much register spilling into local mem), consistent with the new SparseArrays functionality of GPUArrays.jl.

@github-actions
Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/lib/cusparse/device.jl b/lib/cusparse/device.jl
index 6fa563552..dea10c5f6 100644
--- a/lib/cusparse/device.jl
+++ b/lib/cusparse/device.jl
@@ -37,7 +37,7 @@ SparseArrays.nnz(g::CuSparseDeviceMatrixCSC) = g.nnz
 SparseArrays.rowvals(g::CuSparseDeviceMatrixCSC) = g.rowVal
 SparseArrays.getcolptr(g::CuSparseDeviceMatrixCSC) = g.colPtr
 SparseArrays.getnzval(g::CuSparseDeviceMatrixCSC) = g.nzVal
-SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) = @inbounds SparseArrays.getcolptr(g)[col]:(SparseArrays.getcolptr(g)[col+1]-1)
+SparseArrays.nzrange(g::CuSparseDeviceMatrixCSC, col::Integer) = @inbounds SparseArrays.getcolptr(g)[col]:(SparseArrays.getcolptr(g)[col + 1] - 1)
 SparseArrays.nonzeros(g::CuSparseDeviceMatrixCSC) = g.nzVal
 
 const CuSparseDeviceColumnView{Tv, Ti} = SubArray{Tv, 1, <:CuSparseDeviceMatrixCSC{Tv, Ti}, Tuple{Base.Slice{Base.OneTo{Int}}, Int}}

@kshyatt
Copy link
Member

kshyatt commented Nov 13, 2025

Just fyi these types are probably going away soon -- I honestly don't remember if I had @inbounds on the GPUArrays.jl function 🙈

@Zinoex
Copy link
Contributor Author

Zinoex commented Nov 13, 2025

I'll cross that bridge when I need to. The GPUArrays.jl sparse device arrays look to be a drop-in replacement, but I also have some quite complex kernels in IntervalMDP.jl operating on CuSparseMatrixCSC, and therefore, I expect to run into issues when transitioning - and I most certainly don't have the time to fix that atm.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: 48ee0d2 Previous: 2e983fe Ratio
latency/precompile 56920777390 ns 56427085830.5 ns 1.01
latency/ttfp 8317020437.5 ns 8362501410 ns 0.99
latency/import 4498206426 ns 4521778039 ns 0.99
integration/volumerhs 9624244.5 ns 9624952.5 ns 1.00
integration/byval/slices=1 147221 ns 146870 ns 1.00
integration/byval/slices=3 426220 ns 425790 ns 1.00
integration/byval/reference 145136 ns 144866 ns 1.00
integration/byval/slices=2 286528.5 ns 286021 ns 1.00
integration/cudadevrt 103621 ns 103323 ns 1.00
kernel/indexing 14182 ns 14090 ns 1.01
kernel/indexing_checked 14943 ns 14977.5 ns 1.00
kernel/occupancy 691.2333333333333 ns 670.5886075949367 ns 1.03
kernel/launch 2199.3333333333335 ns 2115.8 ns 1.04
kernel/rand 18671 ns 16842 ns 1.11
array/reverse/1d 19950 ns 19633 ns 1.02
array/reverse/2dL_inplace 66907 ns 66698 ns 1.00
array/reverse/1dL 70183 ns 69881 ns 1.00
array/reverse/2d 21758 ns 21367 ns 1.02
array/reverse/1d_inplace 9638 ns 9601 ns 1.00
array/reverse/2d_inplace 13400 ns 13220 ns 1.01
array/reverse/2dL 73895 ns 73483 ns 1.01
array/reverse/1dL_inplace 66908 ns 66751 ns 1.00
array/copy 20830 ns 20712 ns 1.01
array/iteration/findall/int 157172.5 ns 156846 ns 1.00
array/iteration/findall/bool 139875 ns 139935.5 ns 1.00
array/iteration/findfirst/int 161160 ns 160606 ns 1.00
array/iteration/findfirst/bool 161959 ns 161405 ns 1.00
array/iteration/scalar 73837 ns 72218 ns 1.02
array/iteration/logical 216845.5 ns 215761.5 ns 1.01
array/iteration/findmin/1d 50519 ns 49669 ns 1.02
array/iteration/findmin/2d 96561.5 ns 96275.5 ns 1.00
array/reductions/reduce/Int64/1d 43137 ns 43492 ns 0.99
array/reductions/reduce/Int64/dims=1 44422 ns 44664.5 ns 0.99
array/reductions/reduce/Int64/dims=2 61610 ns 61641 ns 1.00
array/reductions/reduce/Int64/dims=1L 88803 ns 88640 ns 1.00
array/reductions/reduce/Int64/dims=2L 87915.5 ns 87635.5 ns 1.00
array/reductions/reduce/Float32/1d 37370 ns 36681 ns 1.02
array/reductions/reduce/Float32/dims=1 51991 ns 48806 ns 1.07
array/reductions/reduce/Float32/dims=2 59879 ns 59459 ns 1.01
array/reductions/reduce/Float32/dims=1L 52434 ns 52065 ns 1.01
array/reductions/reduce/Float32/dims=2L 72183 ns 71664 ns 1.01
array/reductions/mapreduce/Int64/1d 43399 ns 43256 ns 1.00
array/reductions/mapreduce/Int64/dims=1 44879.5 ns 44863 ns 1.00
array/reductions/mapreduce/Int64/dims=2 61690 ns 61500 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 89152 ns 88638 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 88167 ns 87897.5 ns 1.00
array/reductions/mapreduce/Float32/1d 37366 ns 36277.5 ns 1.03
array/reductions/mapreduce/Float32/dims=1 48730 ns 41259 ns 1.18
array/reductions/mapreduce/Float32/dims=2 60130 ns 59440 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 52689 ns 52331.5 ns 1.01
array/reductions/mapreduce/Float32/dims=2L 72524.5 ns 71656.5 ns 1.01
array/broadcast 20031 ns 19817 ns 1.01
array/copyto!/gpu_to_gpu 12986 ns 11436 ns 1.14
array/copyto!/cpu_to_gpu 214181 ns 215179 ns 1.00
array/copyto!/gpu_to_cpu 282787 ns 282618 ns 1.00
array/accumulate/Int64/1d 124556 ns 124273 ns 1.00
array/accumulate/Int64/dims=1 83102 ns 83182 ns 1.00
array/accumulate/Int64/dims=2 157715 ns 157485 ns 1.00
array/accumulate/Int64/dims=1L 1709359 ns 1709450 ns 1.00
array/accumulate/Int64/dims=2L 966166 ns 966304 ns 1.00
array/accumulate/Float32/1d 109001 ns 108932 ns 1.00
array/accumulate/Float32/dims=1 80062 ns 80065 ns 1.00
array/accumulate/Float32/dims=2 147542.5 ns 146929 ns 1.00
array/accumulate/Float32/dims=1L 1618641 ns 1618534.5 ns 1.00
array/accumulate/Float32/dims=2L 698238 ns 697506 ns 1.00
array/construct 1282.2 ns 1270.6 ns 1.01
array/random/randn/Float32 45167.5 ns 47947 ns 0.94
array/random/randn!/Float32 24926 ns 24918 ns 1.00
array/random/rand!/Int64 27311 ns 27167 ns 1.01
array/random/rand!/Float32 8903.666666666666 ns 8884.333333333334 ns 1.00
array/random/rand/Int64 29812 ns 37695.5 ns 0.79
array/random/rand/Float32 13240.5 ns 12943 ns 1.02
array/permutedims/4d 55770.5 ns 59797.5 ns 0.93
array/permutedims/2d 54046 ns 53660 ns 1.01
array/permutedims/3d 54951 ns 54666 ns 1.01
array/sorting/1d 2757753 ns 2757791.5 ns 1.00
array/sorting/by 3344532 ns 3344326 ns 1.00
array/sorting/2d 1080947 ns 1080588 ns 1.00
cuda/synchronization/stream/auto 1044.8 ns 1040 ns 1.00
cuda/synchronization/stream/nonblocking 7844.4 ns 6879.299999999999 ns 1.14
cuda/synchronization/stream/blocking 857.6823529411764 ns 805.0612244897959 ns 1.07
cuda/synchronization/context/auto 1196 ns 1175.2 ns 1.02
cuda/synchronization/context/nonblocking 7480 ns 7439.7 ns 1.01
cuda/synchronization/context/blocking 932.9230769230769 ns 896.560975609756 ns 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants