# AMDA

## OPENCL GPU BEST PRACTICES

BENJAMIN COQUELLE MAY 2015

### 

#### ▲ Data transfer

#### Parallelism

- Coalesced memory access
- Best work group size
- Occupancy
- branching

▲ All the performance numbers come from a W8100 running on a 14.502.1019 driver

### DATA TRANSFER

- To transfer data to the GPU, data need to be page locked. This operation is called pinning and it is costly (CPU time).
  - Therefore, by default, each time you call a data transfer function, we need to pin the host buffer :
    - clEnqueueWriteBuffer(queue, devicebuffer, ..., hostbuffer, ...)
    - clEnqueueReadBuffer(queue, devicebuffer, ..., hostbuffer, ...)
- OpenCL provides a mechanism to "pre-pinned" a buffer and thus achieve the best transfer rate on the PCIE bus
- 1. pinnedBuffer = clCreateBuffer( CL\_MEM\_ALLOC\_HOST\_PTR or CL\_MEM\_USE\_HOST\_PTR )
- 2. deviceBuffer = clCreateBuffer()
- 3. void \*pinnedMemory = clEnqueueMapBuffer( pinnedBuffer ) //pinning cost is incurred here
- 4. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
- 5. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )
- Typically an application will perform step 1, 2, 3 and 5 once. While the mapped pinned buffer can be uploaded several times from the CPU and thus different data can be uploaded while repeating step 4

### DATA TRANSFER PERFORMANCE RESULT

Write operation

| method     | 0.5MB   | 1MB     | 10MB    | 100MB    |
|------------|---------|---------|---------|----------|
| Pre-pinned | 5GB/s   | 7.5GB/s | 12GB/s  | 12.5GB/s |
| classic    | 2.1GB/s | 3GB/s   | 6.3GB/s | 6.8GB/s  |

#### Read operation

| method     | 0.5MB   | 1MB     | 10MB     | 100MB   |
|------------|---------|---------|----------|---------|
| Pre-pinned | 5.2GB/s | 7.4GB/s | 10.9GB/s | 12.GB/s |
| classic    | 2.GB/s  | 3GB/s   | 6.GB/s   | 6.5GB/s |

- Pre-pinned path is supported for the following calls
  - clEnqueueRead/WriteBuffer
  - clEnqueueRead/WriteImage
  - clEnqueueRead/WriteBufferRect

### CL image calls must use pre-pinned mapped buffers on the host side

### DATA TRANSFER PARALLELISM



### DATA TRANSFER PARALLELISM

#### Modern GPUs have 2 DMAs engines

- Can do read/write operation in parallel
- Can also overlap compute and transfer

#### To achieve parallel compute and transfer in OpenCL, one need to use multiple queues

QueueRead = clCreateCommandQueue() QueueWrite = clCreateCommandQueue() QueueCompute1 = clCreateCommandQueue(); clEnqueueReadBuffer(QueueRead) clEnqueueWriteBuffer(QueueWrite) clEnqueueNDRangeKernel(QueueCompute1)

clFlush(QueueRead);....

On our OpenCL runtime, odd queue number are allocated to DMA1, even queue number to DMA2. Be careful about the order your create your queues

AMDL

| Application Timeline Trace              |             |              |                 |                |          |
|-----------------------------------------|-------------|--------------|-----------------|----------------|----------|
| Milliseconds                            | 7198.401 72 | 215.619 7229 | .293<br>232.837 | 7250.056       | 7267.274 |
| □ Context 0 (0x00000000435DF60)         |             |              |                 |                |          |
| □ Queue 0 - Spectre (0x0000000038E3860) |             |              |                 |                |          |
| Data Transfer                           | <b>1</b>    |              | <b>1</b>        |                |          |
| Kernel Execution                        |             |              |                 |                |          |
| □ Queue 1 - Spectre (0x000000006A1CF10) |             |              |                 |                |          |
| Data Transfer                           |             |              |                 |                |          |
| Kernel Execution                        | Comp        | oute_Kernel  | (               | Compute_Kernel |          |

### DATA TRANSFER

<u>http://developer.amd.com/wordpress/media/2013/07/AMD\_Accelerated\_Parallel\_Processing\_OpenCL\_Programming\_Guide-rev-2.7.pdf</u>, chapter 5.6.2, page 89

https://github.com/AMD-FirePro?tab=repositories

#### OpenCL performance comes from parallelism

- clEnqueueNDRangeKernel (queue, kernel, dim, NULL, globalsize ,...)
- You want to have the biggest global size as possible to spawn as many threads as possible on a massively parallel device
- Hawaii (W9100/S9150) is composed of 44 CUs, each CU has 4 16-length SIMD => 2816 threads
- This GPU can actually have 112640 active threads running at the same time.



- ▲ A "hardware thread" (a wavefront) is composed of 64 threads
- ▲ A wavefront runs on one SIMD inside a CU => a wavefront executes in 4 steps
- Each SIMD can have 10 active wavefronts
- This means we can have 44\*4\*10\*64 = 112640 active threads

| work-group size                              | Number of VGPRS | Number of SGPRS | 2           | LDS SIZE     |  |  |
|----------------------------------------------|-----------------|-----------------|-------------|--------------|--|--|
| Variable                                     |                 |                 | Value       | Device Limit |  |  |
| Device Info                                  |                 |                 |             |              |  |  |
| Device name                                  |                 | ŀ               | Hawaii      |              |  |  |
| Number of compute units                      |                 | 4               | 40          |              |  |  |
| Max number of waves per compute unit         |                 | 4               | 40          |              |  |  |
| Max number of work-groups per compute unit   |                 | 1               | 16          |              |  |  |
| Wavefront size                               |                 | 6               | 64          |              |  |  |
| Kernel Info                                  |                 |                 |             |              |  |  |
| Kernel name                                  |                 | i               | init_kernel |              |  |  |
| Vector GPR usage per work-item               |                 | 1               | 10          | 256          |  |  |
| Scalar GPR usage per work-item               |                 | 1               | 12          | 104          |  |  |
| LDS usage per work-group                     |                 | (               | 0           | 65536        |  |  |
| Flattened work-group size                    |                 | 2               | 256         | 256          |  |  |
| Flattened global work size                   |                 | 2               | 256         | 16777216     |  |  |
| Number of waves per work-group               |                 | 4               | 4           | 4            |  |  |
| Kernel Occupancy                             |                 |                 |             |              |  |  |
| Number of waves limited by Vector GPR and We | ork-group size  | 4               | 40          | 40           |  |  |
| Number of waves limited by Scalar GPR and Wo | rk-group size   | 4               | 40          | 40           |  |  |
| Number of waves limited by LDS and Work-grou | p size          | 4               | 40          | 40           |  |  |
| Number of waves limited by Work-group size   |                 | 4               | 40          | 40           |  |  |
| Limiting factor(s)                           |                 | 1               | None        |              |  |  |
| Estimated occupancy                          |                 | 1               | 100%        |              |  |  |



### PARALLELISM BINARY SEARCH

- This is an example from the SDK used to show a CL2.0 feature
- Though I don't think it always exposes the fastest way of doing a search in an array
- In this example, we do a N-search where N = 256 => at each steps we have M/256 threads running on the GPU. Where M is the array size. This is not always enough to fill the GPU
- By having each thread looking into a different entry in the array (one thread per entry) we can increase the parallelism and actually write a simpler kernel when the array in not too big

| Performance | 4096 | 262144 | 4194304 | 16777216 |  |
|-------------|------|--------|---------|----------|--|
| M-search    | 0.01 | 0.03   | 0.3     | 2.23     |  |
| 256-search  | 0.08 | 0.08   | 0.3     | 1.1      |  |

| NB threads | 4096 | 262144 | 4194304 | 16777216 |
|------------|------|--------|---------|----------|
| M-search   | 4096 | 262144 | 4194304 | 16777216 |
| 256-search | 256  | 1024   | 16384   | 65536    |

- ✓ What is important to notice is we need parallelism to use the GPU.
- ▲ If we don't have a big array, the algorithm exposing the more parallelism will give the best performance.
- Though a brut force approach may not be useful when the array is really big and a N-search can be used to reduce the size before reapplying our brute force algorithm.

### PARALLELISM MULTI-TASKING

-----

#### Sometimes you have a lot of small independent batches to process

- Big linear systems break down in small pieces
- Several small meshes to animate

#### Our GPUs have 8 ACEs, Asynchronous Compute Engines.

- ACEs are responsible for compute shader scheduling
- ACEs are independent
- ACEs dispatch tasks to the compute engines as resources permit
- ACEs are independent virtual engine, enabling true Multiprocessor operation.

AMDL

▲ In OpenCL you can access them by using multiple OpenCL command queues!!

### PARALLELISM GPU ENGINES EXPOSED TO OS, PROCESS EXPLORER

| elect engines to | use for GPU us | age       |           |           |           |           |           |
|------------------|----------------|-----------|-----------|-----------|-----------|-----------|-----------|
| Engine 0         | Engine 1       | Engine 2  | Engine 3  | Engine 4  | Engine 5  | Engine 6  | Engine 7  |
|                  |                |           |           |           |           |           |           |
| ] Engine 8       | Engine 9       | Engine 10 | Engine 11 | Engine 12 | Engine 13 | Engine 14 | Engine 15 |

### MEMORY ACCESS

### Coalesced memory access

- Means adjacent thread access adjacent memory
- Our cache line is 64 bytes
  - any fetch request will actually fetch 64 bytes, even if you only look into a single char
- Data/algorithm needs to be arranged to maximize the bandwidth usage
  - For a simple vector addition a simple linear access is enough to have coalesced access

| T1   | T2   | Т3   | Т4   | T5   | Т6   | T7   | Т8   | Т9   | T10  |  |
|------|------|------|------|------|------|------|------|------|------|--|
| A[0] | A[1] | A[2] | A[3] | A[4] | A[5] | A[6] | A[7] | A[8] | A[9] |  |
| B[0] | B[1] | B[2] | B[3] | B[4] | B[5] | B[6] | B[7] | B[8] | B[9] |  |

AMDL

Here each memory fetch will be in the same cache line

- &A[0] =0xNNNN, &A[1]=0xNNNN + 4, ..., &A[15] = 0xNNNN + 60, &A[16] = 0xNNNN + 64
- We maximize the bandwidth usage

### **MEMORY ACCESS**

What happened if we don't have coalesced access.

For example we have a stride between each relevant data we need to look at

| T1   | T2      | Т3      | T4      | T5      | Т6      | T7      | Т8      | Т9      | T10     | •••• |
|------|---------|---------|---------|---------|---------|---------|---------|---------|---------|------|
| A[0] | A[1+16] | A[2+16] | A[3+16] | A[4+16] | A[5+16] | A[6+16] | A[7+16] | A[8+16] | A[9+16] |      |
| B[0] | B[1+16] | B[2+16] | B[3+16] | B[4+16] | B[5+16] | B[6+16] | B[7+16] | B[8+16] | B[9+16] |      |

AMDL

- Here each fetch will use one cache line
  - &A[0]=0xNNNN, &A[17] =0xNNNN + 68...
- ▲ For the first 10 threads we will fetch 2\*64\*10 = 1280 bytes....
- ....while only 80 bytes are useful, we waste nearly 1kB of data and we only use 1/16 of the bandwidth
- Here we have a simple test case. But this needs to be taken into account when working on more complex data. This is why SoA needs to be preferred over AoS.
- Coherency and locality of your data are key to achieve the best performance

### MEMORY ACCESS GLOBAL\_MEMORY\_BANDWIDTH

| Read linear | Read Linear | Read Single cache | Read random | Read uncombine |
|-------------|-------------|-------------------|-------------|----------------|
| uncached    | cached      | line              |             | uncached       |
| 325 GB/s    | 1473 GB/s   | 3825GB/s          | 54GB/s      | 182GB/s        |

#### //read linear cached

```
val = val + input[gid + 0];
val = val + input[gid + 1]; // this is in l1 cached as requested from previous fetch
val = val + input[gid + 2]; // this is in l1 cached...
```

```
output[gid] = val;
```

#### //read linear uncombined uncached

#define NUM\_READS 32
val = val + input[gid \* NUM\_READS + 0];
val = val + input[gid \* NUM\_READS + 1];
val = val + input[gid \* NUM\_READS + 2];...

### **BEST WORK GROUP SIZE**

- It is important to think about the work group size as it is how you will map the different work-items on the hardware
- On AMD HW, a wavefront is 64 threads => the most efficient work group sizes have to be multiple of 64
   Using 65 threads in a work group will require two wavefronts to execute and waste 63 lanes.
- This value can easily be queried in OpenCL using this API
  - clGetKernelWorkGroupInfo (CL\_KERNEL\_PREFERRED\_WORK\_GROUP\_SIZE\_MULTIPLE)
  - It is available since OpenCL 1.1 and can help you writing more generic code to support different HW vendor

You can also specify the work group size at compile time using \_\_atribute\_\_ in your OpenCL C code

- \_\_attribute\_\_((reqd\_work\_group\_size(8,8,1))) \_\_kernel void ... This will help the compiler and produce a code specific to a work group size of 8x8.
- This can help for some optimizations.
- For example, in such case our compiler won't generate barrier instruction for barrier(); but just a fence

- This is the capacity to keep the GPU busy by being able to have several wavefronts running on the same SIMD
- One SIMD can have up to 10 active wavefronts
- Switching between wavefronts allows to avoid waiting for memory transaction. A fetch/write can take 100s of clock cycle to execute while an add instruction on float takes one clock.

### OCCUPANCY VGPRS



#### We have 64 KB of registers (VGPRs) per SIMD. This is 32 bits registers

- This means 16384 VGRPs.
- A SIMD runs a wavefront of 64 threads => 256 VGPRs per thread maximum
- If a kernel uses more than 256 VGPRs, we start spilling which will affect greatly the performance as the spilling occurs in global memory
- ▲ If we use less than 256 VGPRs we can actually have several wavefront running on the same SIMD
  - With 128 VGPRs we can have 2 waves
  - With 25 we can have 10 waves

#### OCCUPANCY SPILLING

#### There are two main reasons to spill

### ▲ A big kernel

- The bigger the kernel is, the more registers you are likely to use
- In that case the compiler will try to spill to improve the occupancy. But sometime, it is so big that we spill and have a low occupancy (ie megakernel for raytracer : 15k-22k lines for a single kernel)
- The solutions are to "split" the kernel into smaller ones and/or change the algorithm
- Forcing an unroll, this will actually behave like a big kernel
  - Avoid unrolling if you find you use too many registers
- One can easily find the VGPRs usage.
  - codeXL, our profiling tool, shows the GPR usage and the spilling (scratch reg).
  - These information can be directly find in the isa code
  - You can access the isa code with codeXL or by dumping it
  - AMD\_OCL\_BUILD\_OPTIONS\_APPEND=-save-temps

| 0/                  |   |                                                   |
|---------------------|---|---------------------------------------------------|
| extUserElementCount | = | 0;                                                |
| NumVgprs            | = | 11;                                               |
| NumSgprs            | = | 20;                                               |
| FloatMode           | = | 192;                                              |
| IeeeMode            | = | 0;                                                |
| FlatPtr32           | = | 0;                                                |
| ScratchSize         | = | 0 dwords/thread;                                  |
| LDSByteSize         | = | <pre>0 bytes/workgroup (compile time only);</pre> |
|                     |   |                                                   |

AMD

### OCCUPANCY LOCAL MEMORY/WORK GROUP SIZE





- If you have a work group of size 64 and require 32KB of LDS, you won't be able to run more than 2 waves per CU => very low occupancy as two SIMDs won't be used
- ▲ If you have a work group of size 256 and use 32 KB of LDS, you can have up to two waves per SIMD
  - You actually need 8KB per wavefront
  - Thus you can have up to 8 wavefronts running per CU => 2 per SIMD
- ▲ LDS is a very fast low latency programmable cache, but it is a limited amount of resource. Use it sparingly

### OCCUPANCY

### 

#### Use codeXL to see where you lose in occupancy.

The conjunction of LDS/work group size and registers usage will impact the occupancy

| Kernel Occupancy                                          |      |    |
|-----------------------------------------------------------|------|----|
| Number of waves limited by Vector GPR and Work-group size | 40   | 40 |
| Number of waves limited by Scalar GPR and Work-group size | 40   | 40 |
| Number of waves limited by LDS and Work-group size        | 40   | 40 |
| Number of waves limited by Work-group size                | 40   | 40 |
| Limiting factor(s)                                        | None |    |
| Estimated occupancy                                       | 100% |    |

- On very slow kernel the problem can be obvious, request of 32KB of LDS, using more than 128GPRs....
- Though if you are not memory bound having a low occupancy is not necessarily bad. But very few kernels don't depend on the bandwidth

### BRANCHING

In a wavefront all the lanes will execute the same instructions

In case of branching if one thread diverge in a wavefront we will need to go through both path and mask the result of the unwanted path for the others threads

if(get\_local\_id(0)%2==0)

```
{} //executes in T1
```

```
else
```

- {} //executes in T2
- ▲ Here the overall time is T1+T2
- ▲ When branching can be avoided, you will achieve better performance
- ▲ In the example above consider rearranging the data or having two different kernels