Skip to content

Commit 90ebbd1

Browse files
committed
new optimziations
1 parent 7445819 commit 90ebbd1

File tree

4 files changed

+79
-3
lines changed

4 files changed

+79
-3
lines changed

docs/06-optimisation.md

Lines changed: 79 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,44 +62,120 @@ kernel_name<<<dim3(Blocks), dim3(Threads),0,0>>>(arg1,arg2,...);
6262
</div>
6363
* Minimum 256 threads per block is required for the best performance, in general more tuning (architecture dependent) is required.
6464

65+
# Device memory hierarchy
66+
67+
<div class="column">
68+
- Registers (per-thread-access)
69+
- Local memory (per-thread-access)
70+
- Shared memory (per-block-access)
71+
- Global memory (global access)
72+
</div>
73+
74+
<div class="column">
75+
![](img/memlayout.png){width=80%}
76+
</div>
77+
78+
79+
# Device memory hierarchy
80+
81+
<div class="column">
82+
- Registers (per-thread-access)
83+
- Used automatically
84+
- Size on the order of kilobytes
85+
- Very fast access
86+
- Local memory (per-thread-access)
87+
- Used automatically if all registers are reserved
88+
- Local memory resides in global memory
89+
- Very slow access
90+
</div>
91+
92+
<div class="column">
93+
- Shared memory (per-block-access)
94+
- Usage must be explicitly programmed
95+
- Size on the order of kilobytes
96+
- Fast access
97+
- Global memory (per-device-access)
98+
- Managed by the host through HIP API
99+
- Size on the order of gigabytes
100+
- Very slow access
101+
</div>
102+
103+
65104
# Global memory access in device code
105+
<small>
66106

67107
- Global memory access from the device has high latency
68108

69109
- Threads are executed in wavefronts/warps, memory operations are grouped in a similar
70110
fashion
111+
71112
- Memory access is optimized for coalesced access where threads read from and write to successive memory locations
72-
- Exact alignment rules and performance issues depend on the architecture
73113

74-
# Coalesced memory access
114+
- Exact alignment rules and performance issues depend on the architecture
75115

76116
- The global memory loads and stores consist of transactions of a certain size
117+
77118
- If the threads within a wavefront access data within such a block,
78119
only one global memory transaction is needed
79120

80121
- Irregular access patterns result in more transactions!
122+
</small>
81123

82-
# Coalesced memory access example
124+
# Coalesced & strided memory access
83125

84126
<div class="column">
127+
<small>
128+
```
129+
__global__ void memAccess(float *out, float *in)
130+
{
131+
int tid = blockIdx.x*blockDim.x + threadIdx.x;
132+
if(tid != 12) out[tid] = in[tid];
133+
}
134+
```
135+
</small>
136+
![](img/01.png){width=80%}
137+
</div>
138+
139+
<div class="column">
140+
<small>
141+
```
142+
__global__ void memAccess(float *out, float *in)
143+
{
144+
int tid = (blockIdx.x*blockDim.x + threadIdx.x)*stride;;
145+
out[tid ] = in[tid];
146+
}
147+
```
148+
</small>
149+
![](img/coalesced_access_1.png){width=80%}
150+
</div>
151+
152+
153+
154+
# Misaligned memory access
155+
156+
<div class="column">
157+
<small>
85158
```
86159
__global__ void memAccess(float *out, float *in)
87160
{
88161
int tid = blockIdx.x*blockDim.x + threadIdx.x;
89162
if(tid != 12) out[tid + 16] = in[tid + 16];
90163
}
91164
```
165+
</small>
92166
![](img/coalesced_access_4.png){width=80%}
93167
</div>
94168

95169
<div class="column">
170+
<small>
96171
```
97172
__global__ void memAccess(float *out, float *in)
98173
{
99174
int tid = blockIdx.x*blockDim.x + threadIdx.x;
100175
out[tid + 1] = in[tid + 1];
101176
}
102177
```
178+
</small>
103179
![](img/coalesced_access_3.png){width=80%}
104180
</div>
105181

docs/img/01.png

30.6 KB
Loading

docs/img/04.png

37.8 KB
Loading

docs/img/coalesced_access_1.png

25.2 KB
Loading

0 commit comments

Comments
 (0)