Skip to content

Commit 2ceaca6

Browse files
authored
Update README.md
1 parent f4c6135 commit 2ceaca6

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

optimization/01-matrix_transpose/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ __global__ void transpose__naive_kernel(float *in, float *out, int width, int he
3333
```
3434
The index `in_index` increases with `threadIdx.x`, two adjacent threads, `threadIdx.x` and `threadIdx.x+1`, access elements near each other in the gloabl memory. This ensures coalesced reads. On the other hand the writing is strided. Two adjacent threads write to location in memory far away from each other by `height`.
3535

36-
## Transpose with shared memory and bank conflicts
36+
## Transpose with shared memory
3737
Shared Memory (SM) can be used in order to avoid the uncoalesced writing mentioned above.
3838
```
3939
__global__ void transpose_SM_kernel(float *in, float *out, int width,
@@ -57,7 +57,7 @@ __global__ void transpose_SM_kernel(float *in, float *out, int width,
5757
```
5858
The shared memory is local to each CU with about 100 time slower latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
5959

60-
## Transpose with shared memory and bank conflicts
60+
## Transpose with shared memory and no bank conflicts
6161
The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!
6262
```
6363
__global__ void transpose_SM_nobc_kernel(float *in, float *out, int width,

0 commit comments

Comments
 (0)