# **GPU Shared Memory**

Lecture 17 April 29, 2025



Program #7

#### Reading for next time

# **To Dos**

## **A Simple Matrix Multiplication Kernel**

\_\_global\_\_\_ void MatrixMulKernel(float\* d\_M, float\* d\_N, float\* d\_P, int Width)

```
// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of d_P and d_N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
```

```
if ((Row < Width) && (Col < Width)) {
  float Pvalue = 0;
  // each thread computes one element of the block sub-matrix
  for (int k = 0; k < Width; ++k) {
     Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];
  }
  d_P[Row*Width+Col] = Pvalue;
}
     @ David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016</pre>
```

#### 

### Tiled Multiply: Place global memory data into Shared Memory for reuse

 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of M and N

bx = blockld.x tx = threadld.x



bx

tx

....

012 TILE WIDTH-1

0

Ν

| N <sub>0,0</sub> | N <sub>0,1</sub> | N <sub>0,2</sub> | N <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| N <sub>1,0</sub> | N <sub>1,1</sub> | N <sub>1,2</sub> | N <sub>1,3</sub> |
| N <sub>2,0</sub> | N <sub>2,1</sub> | N <sub>2,2</sub> | N <sub>2,3</sub> |
| N <sub>3,0</sub> | N <sub>3,1</sub> | N <sub>3,2</sub> | N <sub>3,3</sub> |

| М <sub>о,</sub>  | M <sub>0,1</sub> | M <sub>0,2</sub> | М <sub>о,</sub><br>2 |
|------------------|------------------|------------------|----------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub>     |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub>     |
| M <sub>3,</sub>  | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub>     |

$$P_{0,0} = M_{0,0} * N_{0,0} + M_{0,1} * N_{1,0} + M_{0,2} * N_{2,0} + M_{0,3} * N_{3,0}$$

$$P_{0,1} = M_{0,0} * N_{0,1} + M_{0,1} * N_{1,1} + M_{0,2} * N_{2,1} + M_{0,3} * N_{3,1}$$

$$P_{1,0} = M_{1,0} * N_{0,0} + M_{1,1} * N_{1,0} + M_{1,2} * N_{2,0} + M_{1,3} * N_{3,0}$$

$$P_{1,1} = M_{1,0} * N_{0,1} + M_{1,1} * N_{1,1} + M_{1,2} * N_{2,1} + M_{1,3} * N_{3,1}$$



#### Read data into SM

SM





| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |

## Work for Block (0,0) Threads use shared memory data in step 0.

Use data in SM



Threads use shared memory data in step 1.

#### N<sub>0.0</sub> N<sub>0,1</sub> N<sub>0,2</sub> N<sub>0,3</sub> < N<sub>1,0</sub> • N<sub>1,3</sub> . N<sub>1,1</sub> N<sub>1,2</sub> N<sub>2,1</sub> N<sub>2,2</sub> N<sub>2,0</sub> N<sub>2,3</sub> • N<sub>3,2</sub> • ▲ N<sub>3,0</sub> → • N<sub>3,3</sub> • N<sub>3,1</sub>

| M <sub>0,0</sub>        | M <sub>0,1</sub> | M <sub>0,2</sub> | M <sub>0,3</sub> |
|-------------------------|------------------|------------------|------------------|
| <b>M</b> <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |
| M <sub>2,0</sub>        | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |
| M <sub>3,0</sub>        | M <sub>3,1</sub> | M <sub>3,2</sub> | M <sub>3,3</sub> |



© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/2007-2016

Use data in SM



SM



| P <sub>0,0</sub> | P <sub>0,1</sub> | P <sub>0,2</sub> | P <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| P <sub>1,0</sub> | P <sub>1,1</sub> | P <sub>1,2</sub> | P <sub>1,3</sub> |
| P <sub>2,0</sub> | P <sub>2,1</sub> | P <sub>2,2</sub> | P <sub>2,3</sub> |
| P <sub>3,0</sub> | P <sub>3,1</sub> | P <sub>3,2</sub> | P <sub>3,3</sub> |

Threads use shared memory data in step 2.



Use data in SM

Threads use shared memory data in step 1.

P<sub>0,0</sub>

SM



| М <sub>о,</sub>  | M <sub>0,1</sub> | M <sub>0,2</sub> | M <sub>0,3</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>1,0</sub> | M <sub>1,1</sub> | M <sub>1,2</sub> | M <sub>1,3</sub> |
| M <sub>2,0</sub> | M <sub>2,1</sub> | M <sub>2,2</sub> | M <sub>2,3</sub> |
| M <sub>3,0</sub> | M <sub>3,1</sub> | M <sub>3,2</sub> |                  |

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016

SM • N<sub>2,0</sub> N<sub>2,1</sub> += M<sub>1.3</sub>\*N • M<sub>0,2</sub> ' M<sub>03</sub> • P<sub>0,2</sub> P<sub>0,3</sub> P ົດເ • M<sub>1,2</sub> ' M<sub>13</sub> ' P<sub>1,3</sub> . • P<sub>1,2</sub> , P<sub>1</sub> 10 • P<sub>2,1</sub> 1 P<sub>2,3</sub> P<sub>2,0</sub> P<sub>2,2</sub> ● P<sub>3,0</sub> . · P<sub>3,1</sub> < P<sub>3,2</sub> , ι P<sub>3,3</sub> ,

#### Use data in SM







## **Barrier Synchronization**

- An API function call in CUDA
  - \_\_syncthreads()
- All threads in the same block must reach the \_\_syncthreads() before any can move on
- Best used to coordinate tiled algorithms
  - To ensure that all elements of a tile are loaded
  - To ensure that all elements of a tile are consumed



Figure 4.11 An example execution timing of barrier synchronization.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ 2007-2016

## **Tiled Matrix Multiplication Kernel**

\_global\_\_ void MatrixMulKernel(float\* M, float\* N, float\* P, int Width)

- 2. \_\_shared\_\_ float subTileN[TILE\_WIDTH][TILE\_WIDTH];
- 3. int bx = blockIdx.x; int by = blockIdx.y;
- 4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the P element to work on

- 5. int Row = by \* TILE WIDTH + ty;
- 6. int Col = bx \* TILE WIDTH + tx;
- 7. float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

8. for (int m = 0; m < Width/TILE\_WIDTH; ++m) {

// Collaborative loading of M and N tiles into shared memory

- 9. subTileM[ty][tx] = M[Row\*Width + m\*TILE WIDTH+tx];
- 10. subTileN[ty][tx] = N[(m\*TILE\_WIDTH+ty)\*Width+Col];

```
11. _____syncthreads();
```

- 12. for (int k = 0;  $k < TILE_WIDTH$ ; ++k)
- 13. Pvalue += subTileM[ty][k] \* subTileN[k][tx];
- 14. \_\_\_\_\_syncthreads();
- 15. }
- 16. P[Row\*Width+Col] = Pvalue;
- }

## **Compare with Base Kernel**

```
global void MatrixMulKernel(float* M, float* N, float* P, int Width)
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < Width) && (Col < Width)) {
   float Pvalue = 0;
   // each thread computes one element of the block sub-matrix
   for (int k = 0; k < Width; ++k)
     Pvalue += M[Row*Width+k] * N[k*Width+Col];
   P[Row*Width+Col] = Pvalue;
```

## **Shared Memory and Threading**

- Each SM in Maxwell has 64KB shared memory (48KB max per block)
  - Shared memory size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
    - Shared memory can potentially support up to 32 thread blocks actively executing, but only 8 blocks allowed
      - In reality, if only 1536 threads allowed on SM, only 1536/256 = 6 blocks allowed
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
- Using 16×16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 150GB/s bandwidth can now support (150/4)\*16 = 600 GFLOPS!



## Global Memory (DRAM) Bandwidth

#### Ideal



#### Reality



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **DRAM Bank Organization**



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## A very small DRAM Bank



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **DRAM core arrays are slow.**

• Reading from a cell in the core array is a very slow process



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **DRAM Bursting (burst size = 4 bits)**



ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

# DRAM Bursting (cont.) second part of the burst



## **DRAM Bursting for our Example Bank**



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **Multiple DRAM Banks**



Channel: memory controller w/ bus that connects set of DRAM banks to processor

Copyright © 2016 Elsevier Inc. All rights reserved.

## **Multiple DRAM Banks**



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **DRAM Bursting for the 8×2 Bank**



## **Bank Interleaving**



Copyright © 2016 Elsevier Inc. All rights reserved.

# Placing a 2D C array into linear memory space (review)



## **A Simple Matrix Multiplication Kernel (review)**

```
global void MatrixMulKernel(float* M, float* N, float* P, int Width)
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if ((Row < Width) && (Col < Width)) {
  float Pvalue = 0;
   // each thread computes one element of the block sub-matrix
  for (int k = 0; k < Width; ++k)
     Pvalue += M[Row*Width+k] * N[k*Width+Col];
   P[Row*Width+Col] = Pvalue;
```

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## **Two Access Patterns**



k is loop counter in the inner product loop of the kernel code

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

## N accesses are coalesced.

Access direction in Kernel code



#### N[k\*Width+Col] Consecutive columns



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, 2007-2016

### M accesses are not coalesced.



©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE 498AL, University of Illinois, 2007-2016

## Coalescing

- On load request, all accesses from a warp reduced to the smallest number of DRAM accesses
  - Perfect coalescing (all consecutive accesses) reduce number of DRAM accesses the most
  - No coalescing results in 32 different DRAM accesses