@@ -62,44 +62,120 @@ kernel_name<<<dim3(Blocks), dim3(Threads),0,0>>>(arg1,arg2,...);
6262</div >
6363* Minimum 256 threads per block is required for the best performance, in general more tuning (architecture dependent) is required.
6464
65+ # Device memory hierarchy
66+
67+ <div class =" column " >
68+ - Registers (per-thread-access)
69+ - Local memory (per-thread-access)
70+ - Shared memory (per-block-access)
71+ - Global memory (global access)
72+ </div >
73+
74+ <div class =" column " >
75+ ![ ] ( img/memlayout.png ) {width=80%}
76+ </div >
77+
78+
79+ # Device memory hierarchy
80+
81+ <div class =" column " >
82+ - Registers (per-thread-access)
83+ - Used automatically
84+ - Size on the order of kilobytes
85+ - Very fast access
86+ - Local memory (per-thread-access)
87+ - Used automatically if all registers are reserved
88+ - Local memory resides in global memory
89+ - Very slow access
90+ </div >
91+
92+ <div class =" column " >
93+ - Shared memory (per-block-access)
94+ - Usage must be explicitly programmed
95+ - Size on the order of kilobytes
96+ - Fast access
97+ - Global memory (per-device-access)
98+ - Managed by the host through HIP API
99+ - Size on the order of gigabytes
100+ - Very slow access
101+ </div >
102+
103+
65104# Global memory access in device code
105+ <small >
66106
67107- Global memory access from the device has high latency
68108
69109- Threads are executed in wavefronts/warps, memory operations are grouped in a similar
70110 fashion
111+
71112- Memory access is optimized for coalesced access where threads read from and write to successive memory locations
72- - Exact alignment rules and performance issues depend on the architecture
73113
74- # Coalesced memory access
114+ - Exact alignment rules and performance issues depend on the architecture
75115
76116- The global memory loads and stores consist of transactions of a certain size
117+
77118- If the threads within a wavefront access data within such a block,
78119only one global memory transaction is needed
79120
80121- Irregular access patterns result in more transactions!
122+ </small >
81123
82- # Coalesced memory access example
124+ # Coalesced & strided memory access
83125
84126<div class =" column " >
127+ <small >
128+ ```
129+ __global__ void memAccess(float *out, float *in)
130+ {
131+ int tid = blockIdx.x*blockDim.x + threadIdx.x;
132+ if(tid != 12) out[tid] = in[tid];
133+ }
134+ ```
135+ </small >
136+ ![ ] ( img/01.png ) {width=80%}
137+ </div >
138+
139+ <div class =" column " >
140+ <small >
141+ ```
142+ __global__ void memAccess(float *out, float *in)
143+ {
144+ int tid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;;
145+ out[tid ] = in[tid];
146+ }
147+ ```
148+ </small >
149+ ![ ] ( img/coalesced_access_1.png ) {width=80%}
150+ </div >
151+
152+
153+
154+ # Misaligned memory access
155+
156+ <div class =" column " >
157+ <small >
85158```
86159__global__ void memAccess(float *out, float *in)
87160{
88161 int tid = blockIdx.x*blockDim.x + threadIdx.x;
89162 if(tid != 12) out[tid + 16] = in[tid + 16];
90163}
91164```
165+ </small >
92166![ ] ( img/coalesced_access_4.png ) {width=80%}
93167</div >
94168
95169<div class =" column " >
170+ <small >
96171```
97172__global__ void memAccess(float *out, float *in)
98173{
99174 int tid = blockIdx.x*blockDim.x + threadIdx.x;
100175 out[tid + 1] = in[tid + 1];
101176}
102177```
178+ </small >
103179![ ] ( img/coalesced_access_3.png ) {width=80%}
104180</div >
105181
0 commit comments