# **MIC-GPU:**



**High-Performance Computing** for Medical Imaging on Programmable Graphics **Hardware (GPUs)** 

# **CUDA Memory Optimization**

Klaus Mueller and Sungsoo Ha

**Stony Brook University Computer Science** Stony Brook, NY

# **Optimizing Memory Usage**



#### Minimizing data transfers with low bandwidth

- Minimizing host & device transfer
- Maximizing usage of shared memory
- Re-computing can sometimes be cheaper than transfer

#### Organizing memory accesses based on the optimal memory access patterns

- Important for global memory access (low bandwidth)
- Shared memory accesses are usually worth optimizing only in case they have a high degree of bank conflicts

**SPIE Medical Imaging 2016** 

MIC-GPU

### **Global Memory Coalescing**



#### Warps & global memory

- Threads execute by warp (32)



Block

- Block width must be divisible by 16 for coalescing
- Check your hardware (Compute Capability 1.x)
- Greatly improves throughput (Can yield speedups of >10)

egual to 32, 64, or 128 bytes and aligned to these sizes.

# **Global Memory Coalescing**



#### Compute Capability 1.0 or 1.1

Aligned 64 or 128 bytes segment Half-warp



- Sequential warp
- Divergent warp
- See some good patterns in CUDA document and CUDA SDK samples

#### Compute Capability 1.2 or higher

- 32, 64 or 128 bytes segment
- Any pattern as long as inside segment

SPIE Medical Imaging 2016 MIC-GPU **SPIE Medical Imaging 2016** MIC-GPU

### Sobel Filter Effect



Before:

After:





MIC-GPU

ml mml mr

Output to another global memory

# Example: Sobel Filter

Medical Imagino

13

Discrete convolution with Sobel mask



**SPIE Medical Imaging 2016** 

one output pixel

MIC-GPU

Output

image

# R/W Global Memory

Medical Imaging

### Bad access pattern

- Global memory only. No texture memory or shared memory of clock cycles, compared to 1 or 2 for reading from shared memory
- Unstructured read mr
- No cache, upito 1/2 global memory reads per t Input from global memory

\_\_global\_\_ void

SobelBadKernel(unsigned char\* Input, unsigned char\* output, unsigned int width, unsigned int height)

..../calculate the index for ur, ul, um, ml, mr, ll, lm, lr.

float Horz=Input[ur] +Input[lr] +2.0\*Input[mr] -2.0\*Input[ml] -Input[ul] -Input[ll];

float Vert=Input[ur] +Input[ul] +2.0\*Input[um] -2.0\*Input[lm] -Input[ll] -Input[ll] ;

output[resultindex] = abs(Horz)+abs(Vert);

SPIE Medical Imaging 2016

MIC-GPU

# Reduce Global Memory Read

Medical Imagino

```
__device__ unsigned char ComputeSobel(
       unsigned char ul,
       unsigned char um
       unsigned char ur,
       unsigned char ml,
       unsigned char mm, //not used
       unsigned char mr,
       unsigned char II,
                                        Reduce 12 reads into 8 reads
       unsigned char Im,
       unsigned char Ir,
       float fScale ){
 short Horz = ur + 2*mr + lr - ul - 2*ml - ll:
 short Vert = ul + 2*um + ur - II - 2*lm - Ir:
 short Sum = (short) (fScale*(abs(Horz)+abs(Vert)));
 if (Sum < 0) return 0; else if (Sum > 255) return 255;
 return (unsigned char) Sum;}
```

### Reading Texture Memory



### Take advantage of CUDA (texture memory)

- Using cache (texture memory) to enhance performance
- Each kernel can compute more than one pixels. This can help to exploit locality for cache
- Texture memory itself is optimized for coalescing

**SPIE Medical Imaging 2016** 

MIC-GPU

۵

# **Reading Texture Memory**



- Texture memory only.
- No shared memory

```
Global memory as output.
Need consider coalescing
when write back
```

```
unsigned char *pSobel = (unsigned char *) (((char *) pSobelOrio when write back for ( int i = threadIdx.x; i < w; i += blockDim.x ) {
    unsigned char pix00 = tex2D( tex, (float) i-1, (float) blockIdx.x-1 );
    unsigned char pix01 = tex2D( tex, (float) i+0, (float) blockIdx.x-1 );
    unsigned char pix02 = tex2D( tex, (float) i+1, (float) blockIdx.x-1 );
    unsigned char pix10 = tex2D( tex, (float) i-1, (float) blockIdx.x+0 );
    unsigned char pix11 = tex2D( tex, (float) i+0, (float) blockIdx.x+0 );
    unsigned char pix12 = tex2D( tex, (float) i+1, (float) blockIdx.x+1 );
    unsigned char pix20 = tex2D( tex, (float) i-1, (float) blockIdx.x+1 );
    unsigned char pix22 = tex2D( tex, (float) i+1, (float) blockIdx.x+1 );
    pSobel[i] = ComputeSobel(pix00, pix01, pix02, pix10, pix11, pix12, pix20, pix21, pix22, fScale );}
```

**SPIE Medical Imaging 2016** 

MIC-GPU

10

# Improve Caching?



#### Disadvantage

- Only using hardware cache to handle spatial locality
- A pixel may be still loaded 9 times in total due to cache miss

### Take advantage of CUDA Shared Memory

- Shared memory can be as fast as register! As a user-controlled cache.
- Together with texture memory, load a block of the image into shared memory
- 2. Each thread compute a consecutive rows of pixels (sliding window)
- 3. Writing result to global memory □

# Returning Example : Sobel Filter





Applying vertical and horizontal masks



Each thread will compute a number of consecutive rows of pixel

Computing all pixels inside one block (without apron)

SPIE Medical Imaging 2016 MIC-GPU 11 SPIE Medical Imaging 2016 MIC-GPU 12

### Reading Shared Memory

SPIE Medical Imaging

Shared memory + texture memory.

\_\_shared\_\_ unsigned char shared[]; kernel<<<br/>blocks, threads, sharedMem>>>(...);

| 2 | 5 | 6 | 2 | 5 | 6 |
|---|---|---|---|---|---|
| 4 | 2 | 7 | 4 | 2 | 7 |
| 5 | 2 | 6 | 5 | 2 | 6 |

**SPIE Medical Imaging 2016** 

MIC-GPU

13

# **Shared Memory Bank Conflicts**



#### Shared memory banks

Shared memory is divided into 32 banks to reduce conflicts

Shared memory



Each thread can access 32-bit from different banks simultaneously to achieve high memory bandwidth

Half-warp



 Conflict-free shared memory as fast as registers

Shared memory



**SPIE Medical Imaging 2016** 

MIC-GPU

14

16

# **Shared Memory Bank Conflicts**

SPIE Medical Imaging

### **Compute Capability 1.x**

4-way bank conflicts

\_\_shared\_\_ char shared[32]; char data = shared[BaseIndex + tid];

No bank conflicts

char data = shared[BaseIndex + 4 \* tid];

Half-warp

Shared memory

### Compute Capability 2.x

 Bank conflicts occur when multiple threads access different words in the same bank

# **Shared Memory Broadcasting**



Shared memory read a 32-bit word and broadcast to several threads simultaneously

- Read
- Reduce or resolve bank conflicts if set to broadcasting
- Which word is selected as the broadcast word and which address is picked up for each bank at each cycle is unspecified



SPIE Medical Imaging 2016 MIC-GPU

**SPIE Medical Imaging 2016** 

MIC-GPU

15

### Returning Example: Sum of Numbers



#### Add up a large set of numbers

Normalization factor:

$$S = \sum_{i=0}^{n-1} v[i]$$

Mean square error:

$$MSE = \sum_{i=0}^{n-1} (a[i] - b[i])^{2}$$

Number of addition operations:



**SPIE Medical Imaging 2016** 

MIC-GPU

17

# Non-parallel approach



Input numbers: 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

■Non-parallel approach:

Generate only one thread 41 | 1 | 8 | -1 | 0 | -2 | 3 | 5 | -2 | -3 | 2 | 7 | 0 | 11 | 0 | 2 |

O(n) additions

Rule 1: Maximized parallel execution

**SPIE Medical Imaging 2016** 

MIC-GPU

18

# Parallel Approach: Kernel 1



☐Interleaved addressing: Kernel 1



# Parallel Approach: Kernel 1



#### CUDA code:

```
global void reduce0(int *g idata, int *g odata) {
extern shared int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadldx.x;
unsigned int i = blockldx.x*blockDim.x + threadldx.x;
sdata[tid] = g_idata[i];
_syncthreads(); Rule 3: Maximize instruction throughput
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
  if (tid % (2*s) == 0) {
                                        inefficient statement, % operator
    sdata[tid] += sdata[tid + s];
                                                     is very slow
    syncthreads();
// write result for this block to global mem
if (tid == 0) g_odata[blockldx.x] = sdata[0];
                                           Writing with global memory
                                                     coalescing
                         MIC-GPU
```

**SPIE Medical Imaging 2016** 

Rule 2 : Optimize memory usage

# Parallel Approach: Kernel 2



#### Refinement strategy:

Just replace divergent branch in inner loop:

```
for (unsigned int s=1; s < blockDim.x; s *= 2) {
  if (tid % (2*s) == 0) {
     sdata[tid] += sdata[tid + s];
    syncthreads();
```

#### With strided index and non-divergent branch:

```
for (unsigned int s=1: s < blockDim.x: s *= 2) {
  int index = 2 * s * tid;
  if (index < blockDim.x) {
     sdata[index] += sdata[index + s];
     syncthreads():
```

**SPIE Medical Imaging 2016** 

**SPIE Medical Imaging 2016** 

MIC-GPU

MIC-GPU

21

# Parallel Approach: Kernel 2



□Interleaved addressing: Kernel 2



### Shared memory conflict!

Law 2 : Optimize memory usage

MIC-GPU **SPIE Medical Imaging 2016** 

# Parallel Approach: Kernel 3







Parallel Approach: Kernel 3

SPIE Medical Imaging

22

#### CUDA code:

Just replace strided indexing in inner loop:



With reversed loop and threadID-based indexing:

```
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
  if (tid < s) {
     sdata[tid] += sdata[tid + s];
                                               Half of the threads are idle in
                                                   first loop iteration!
    syncthreads();
                                                          Wasteful!
```

Law 3: Maximize instruction throughput

23 **SPIE Medical Imaging 2016 MIC-GPU** 24

# **Toward Final Optimized Kernel**



#### Performance for 4M numbers:

|                                                                 | Time (222 ints) | Bandwidth  | Step<br>Speedup | Cumulative<br>Speedup |
|-----------------------------------------------------------------|-----------------|------------|-----------------|-----------------------|
| Kernel 1:<br>Interleaved addressing<br>with divergent branching | 8.054 ms        | 2.083 GB/s |                 |                       |
| Kernel 2:<br>interleaved addressing<br>with bank conflicts      | 3.456 ms        | 4.854 GB/s | 2.33x           | 2.33x                 |
| Kernel 3:<br>sequential addressing                              | 1.722 ms        | 9.741 GB/s | 2.01x           | 4.68x                 |

SPIE Medical Imaging 2016 MIC-GPU 25

# **Further Optimizations**



#### Kernel 4 Rule 3

Rule 3: Maximize instruction throughput

Halve the number of blocks, with two loads

#### Kernel 5

Unrolling last loop

#### Kernel 6

Completely unrolling loops

#### Kernel 7

- · Multiple element per thread
- See details changes in

M. Harris, Optimizing parallel reduction with CUDA

**SPIE Medical Imaging 2016** 

MIC-GPU

26

# **Towards Final Optimized Kernel**



#### Performance for 4M numbers:

|                                                                 | Time (222 ints) | Bandwidth   | Step<br>Speedup | Cumulative<br>Speedup |
|-----------------------------------------------------------------|-----------------|-------------|-----------------|-----------------------|
| Kernel 1:<br>Interleaved addressing<br>with divergent branching | 8.054 ms        | 2.083 GB/s  |                 |                       |
| Kernel 2:<br>Interleaved addressing<br>with bank conflicts      | 3.456 ms        | 4.854 GB/s  | 2.33x           | 2.33x                 |
| Kernel 3:<br>sequential addressing                              | 1.722 ms        | 9.741 GB/s  | 2.01x           | 4.68x                 |
| Kernel 4:<br>first add during global load                       | 0.965 ms        | 17.377 GB/s | 1.78x           | 8.34x                 |
| Kernel 5:<br>unroll last warp                                   | 0.536 ms        | 31.289 GB/s | 1.8x            | 15.01x                |
| Kernel 6:<br>completely unrolled                                | 0.381 ms        | 43.996 GB/s | 1.41x           | 21.16x                |

#### Final optimized kernel:

| Kernel 7:                    | 0.268 me   | 62.671 GB/s  | 1.42x | 30.04x |
|------------------------------|------------|--------------|-------|--------|
| multiple elements per thread | 0.200 1113 | 02.07 1 GD/5 | 1.424 | 30.04  |

# **Best Programming Practices**



#### Three basic strategies

- Maximize parallel execution
- Optimize memory usage → achieve maximum memory bandwidth
- Optimize instruction bandwidth → maximize instruction throughput

#### Maximized parallel execution

- Minimize number of synchronization barriers → let it flow
- Minimize divergent flows
- Better synchronize within a block than across → group threads

#### Optimize memory usage

- Map poor coalescing patterns in global memory to shared memory
- Load data in coalesced chunks before computation begins
  - → #1: Get good coalescing in global memory
  - → #2: Get conflict-free data access in shared memory

SPIE Medical Imaging 2016 MIC-GPU 27 SPIE Medical Imaging 2016 MIC-GPU 28

# **Best Programming Practices**



Maximize instruction throughput

- 1. Instruction level
  - →operator, branches and loops
- 2. Optimize execution configuration

Kernel will fail to launch if

- Number of threads per block >> max number of threads per block
- Requires too many registers or shared memory than available

# **Best Programming Practices**



#### **Block**

• At least as many blocks as multiprocessors (SMs)

#### Threads

**SPIE Medical Imaging 2016** 

- Chose number of threads/block as a multiple of the warp size
- Typically 192 or 256 threads per block
- But watch out for required registers and shared memory
- Check Visual profiler or Occupancy Calculator

#### Multiprocessor occupancy

- Ratio of number of active warps per SM over max number of warps
- Visual profiler or Occupancy Calculator
  - Choose thread block size based on shared memory and registers

**SPIE Medical Imaging 2016** 

MIC-GPU

29

MIC-GPU

30