Skip to content

Commit a3a1ef6

Browse files
committed
Remove shared_array!
It's wildly unsound, as far as I can tell. - The use of `UnsafeCell`/`unsafe impl Send`/`unsafe impl Sync` is very dodgy. - The use of `MaybeUninit<[T; N]>` is completely wrong. It should be `[<MaybeUninit<T>; N>]`, which allows for partial initialization. - The examples don't use `write`/`assume_init` with `MaybeUninit`, which is UB. The alternative is to use `#[address_space(shared) static mut [MaybeUninit<T>; N]` directly instead. That avoids all the unsoundness, and is more ergonomic because you don't have to work with a raw pointer, and is clearer because details aren't hidden within the macro. I moved (with some modifications) the comments on `shared_array` to the `address_space` proc macro because there were some useful details in there.
1 parent b0c4c71 commit a3a1ef6

File tree

4 files changed

+57
-80
lines changed

4 files changed

+57
-80
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

crates/cuda_std/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Notable changes to this project will be documented in this file.
77
- Added warp shuffles, matches, reductions, and votes in the `warp` module.
88
- Added `activemask` in the `warp` module to query a mask of the active threads.
99
- Fixed `lane_id` generating invalid ptx.
10+
- Removed `shared_array!` due to unsoundness.
1011

1112
## 0.2.2 - 2/7/22
1213

crates/cuda_std/src/shared.rs

Lines changed: 3 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,9 @@
1-
//! Static and Dynamic shared memory handling.
1+
//! Dynamic shared memory handling.
2+
//!
3+
//! Static shared memory is done via `#[address_space(shared)] static mut ...;`.
24
35
use crate::gpu_only;
46

5-
/// Statically allocates a buffer large enough for `len` elements of `array_type`,
6-
/// yielding a `*mut array_type` that points to uninitialized shared memory. `len` must
7-
/// be a constant expression.
8-
///
9-
/// Note that this allocates the memory __statically__, it expands to a static in the
10-
/// `shared` address space. Therefore, calling this macro multiple times in a loop will
11-
/// always yield the same data. However, separate invocations of the macro will yield
12-
/// different buffers.
13-
///
14-
/// The data is uninitialized by default, therefore, you must be careful to not read the
15-
/// data before it is written to. The semantics of what "uninitialized" actually means
16-
/// on the GPU (i.e. if it yields unknown data or if it is UB to read it whatsoever) are
17-
/// not well known, so even if the type is valid for any backing memory, make sure to
18-
/// not read uninitialized data.
19-
///
20-
/// # Safety
21-
///
22-
/// Shared memory usage is fundamentally extremely unsafe and impossible to statically
23-
/// prove, therefore the burden of correctness is on the user. Some of the things you
24-
/// must ensure in your usage of shared memory are:
25-
///
26-
/// - Shared memory is only shared across __thread blocks__, not the entire device,
27-
/// therefore it is unsound to try and rely on sharing data across more than one
28-
/// block.
29-
/// - You must write to the shared buffer before reading from it as the data is
30-
/// uninitialized by default.
31-
/// - [`thread::sync_threads`](crate::thread::sync_threads) must be called before
32-
/// relying on the results of other threads, this ensures every thread has reached
33-
/// that point before going on. For example, reading another thread's data after
34-
/// writing to the buffer.
35-
/// - No access may be out of bounds, this usually means making sure the amount of
36-
/// threads and their dimensions are correct.
37-
///
38-
/// It is suggested to run your executable in `cuda-memcheck` to make sure usages of
39-
/// shared memory are right.
40-
///
41-
/// # Examples
42-
///
43-
/// ```no_run
44-
/// # use cuda_std::kernel;
45-
/// # use cuda_std::shared_array;
46-
/// # use cuda_std::thread;
47-
/// ##[kernel]
48-
/// pub unsafe fn reverse_array(d: *mut i32, n: usize) {
49-
/// let s = shared_array![i32; 64];
50-
/// let t = thread::thread_idx_x() as usize;
51-
/// let tr = n - t - 1;
52-
/// *s.add(t) = *d.add(t);
53-
/// thread::sync_threads();
54-
/// *d.add(t) = *s.add(tr);
55-
/// }
56-
/// ```
57-
#[macro_export]
58-
macro_rules! shared_array {
59-
($array_type:ty; $len:expr) => {{
60-
#[$crate::gpu_only]
61-
#[inline(always)]
62-
fn shared_array() -> *mut $array_type {
63-
use ::core::{cell::UnsafeCell, mem::MaybeUninit};
64-
struct SyncWrapper(UnsafeCell<MaybeUninit<[$array_type; $len]>>);
65-
// SAFETY: it is up to the user to verify sound shared memory usage, we cannot
66-
// fundamentally check it for soundness.
67-
unsafe impl Send for SyncWrapper {}
68-
// SAFETY: see above
69-
unsafe impl Sync for SyncWrapper {}
70-
71-
// the initializer is discarded when declaring shared globals, so it is unimportant.
72-
#[$crate::address_space(shared)]
73-
static SHARED: SyncWrapper = SyncWrapper(UnsafeCell::new(MaybeUninit::uninit()));
74-
75-
SHARED.0.get() as *mut $array_type
76-
}
77-
shared_array()
78-
}};
79-
}
80-
817
/// Gets a pointer to the dynamic shared memory that was allocated by the caller of the kernel. The
828
/// data is left uninitialized.
839
///

crates/cuda_std_macros/src/lib.rs

Lines changed: 52 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -205,12 +205,61 @@ pub fn externally_visible(
205205
}
206206

207207
/// Notifies the codegen to put a `static`/`static mut` inside of a specific memory address space.
208-
/// This is mostly for internal use and/or advanced users, as the codegen and `cuda_std` handle address space placement
209-
/// implicitly. **Improper use of this macro could yield weird or undefined behavior**.
208+
/// This is mostly for internal use and/or advanced users, as the codegen and `cuda_std` handle
209+
/// address space placement implicitly. **Improper use of this macro could yield weird or undefined
210+
/// behavior**.
210211
///
211-
/// This macro takes a single argument which can either be `global`, `shared`, `constant`, or `local`.
212+
/// This macro takes a single argument which can either be `global`, `shared`, `constant`, or
213+
/// `local`.
212214
///
213215
/// This macro does nothing on the CPU.
216+
///
217+
/// # Shared memory
218+
///
219+
/// The item `#[address_space(shared) static mut FOO: [MaybeUninit<T>; N];` statically allocates a
220+
/// buffer large enough for `N` elements of type `T`, yielding an uninitialized array in shared
221+
/// memory.
222+
///
223+
/// Note that this allocates the memory __statically__, i.e. it expands to a static in the `shared`
224+
/// address space. Therefore, calling this macro multiple times in a loop will always yield the
225+
/// same data. However, separate invocations of the macro will yield different buffers.
226+
///
227+
/// Because the data is uninitialized by default, the type within the array must be `MaybeUninit`,
228+
/// and uses must follow the usual rules of `MaybeUninit`, such as using `write`/`assume_init`.
229+
/// Using a non-`MaybeUninit` type is undefined behaviour.
230+
///
231+
/// # Safety
232+
///
233+
/// Shared memory usage is fundamentally unsafe and much of the burden of correctness is on the
234+
/// user. For example:
235+
/// - Shared memory is only shared across __thread blocks__, not the entire device, therefore it is
236+
/// unsound to rely on sharing data across more than one block.
237+
/// - You must write to the shared buffer before reading from it as the data is uninitialized by
238+
/// default.
239+
/// - `cuda_std::thread::sync_threads` must be called before relying on the results of other
240+
/// threads. This ensures every thread has reached that point before going on. For example, when
241+
/// reading another thread's data after writing to the buffer.
242+
///
243+
/// It is suggested to run your executable in `cuda-memcheck` to make sure usages of
244+
/// shared memory are right.
245+
///
246+
/// # Examples
247+
///
248+
/// ```ignore
249+
/// use core::mem::MaybeUninit;
250+
/// use cuda_std::*;
251+
///
252+
/// ##[kernel]
253+
/// pub unsafe fn reverse_array(d: *mut u32, n: usize) {
254+
/// ##[address_space(shared)]
255+
/// static mut S: [MaybeUninit<u32>; 64] = [const { MaybeUninit::uninit() }; 64];
256+
/// let i = thread::thread_idx_x() as usize;
257+
/// let ir = n - i - 1;
258+
/// unsafe { S[i].write(*d.add(i)); };
259+
/// thread::sync_threads();
260+
/// unsafe { *d.add(i) = S[ir].assume_init(); }
261+
/// }
262+
/// ```
214263
#[proc_macro_attribute]
215264
pub fn address_space(attr: proc_macro::TokenStream, item: proc_macro::TokenStream) -> TokenStream {
216265
let mut global = syn::parse_macro_input!(item as syn::ItemStatic);

0 commit comments

Comments
 (0)