Skip to main content

Offloading code with compiler directives

image of code

In our last blog we discussed how to convert CUDA applications to HIP. In this blog we will be scratching the surface on how to offload your code to a device with directives.

One of the less intrusive ways to parallelize and offload code is through the use of directive based approaches. This approach consists in inserting special directives into existing code. These directives are instructions to the compiler on how to parallelize and offload the code. Two standards have wide support, OpenMP and OpenACC, which both work on the same host/device model. However, some differences exist between them, some minor and some major, that can make transitioning between them more difficult.

The goal of this blog is to provide some insight into how directive based offloading works so that you can decide if it is the right choice for your code. For a more in-depth presentation, we recommend any of GPU programming with OpenMP/OpenACC training carried out by partners within the LUMI consortium or other similar training.

OpenMP

OpenMP has emerged as somewhat the de facto standard for adding shared memory parallelism to HPC codes, i.e. parallelizing code within a node. An OpenMP directive starts with #pragma omp (C/C++) or !$omp (Fortran).

The most basic construct in OpenMP is the parallel construct, which instructs the compiler to generate parallel code for the following statement or basic block, or, in Fortran, the code between the directive and the end directive. For a more in-depth introduction to parallel programming with OpenMP for CPUs check out the OpenMP in Small Bites tutorial.

As an example, let’s consider a simple print within a `parallel` region. The code presented below will execute the print statement in parallel meaning that it will end up printing the text multiple times. How many depends on how many threads OpenMP is instructed to spawn.

C/C++ Fortran
#pragma omp parallel
{
  printf("Hello, world.\n");
}
!$omp parallel
print *, 'Hello, world.'
!$omp end parallel

 

While this gets us parallel execution, it’s hard to do effective parallel work with just parallel directives. This is where the worksharing-loop constructs come into play. The primary work sharing directive is the for directive. This directive allows us to take a loop and distribute the iterations of the loop over multiple threads working on a chunk of the loop in parallel. While OpenMP allows you to easily parallelize code, you’re still responsible for the code running correctly in parallel, meaning it is up to you to sort out any race conditions, etc.

For a more practical example of how work sharing works with OpenMP we use the triad stream operation: A[i]=B[i]+scalar*C[i]. Notice that we use the parallel for directive to distribute the iterations of the two loops present in the code over multiple threads.

C/C++ Fortran
#include <stdlib.h>

int main() {
  size_t n = 1000000000;
  double *A = (double*) malloc(n*sizeof(double));
  double *B = (double*) malloc(n*sizeof(double));
  double *C = (double*) malloc(n*sizeof(double));
  double scalar = 2;

  #pragma omp parallel for
  for (size_t i = 0; i < n; i++) {
     B[i] = 1;
     C[i] = i;
  }

  #pragma omp parallel for
  for(size_t i = 0; i < n; i++) {
    A[i] = B[i] + scalar * C[i];
  }

  free(A);
  free(B);
  free(C);

  return 0;
}
program main
  implicit none
  integer, parameter :: n = 1000000000
  integer :: i

  double precision :: A(n), B(n), C(n)
  double precision :: scalar = 2

  !$omp parallel do
  do i = 1, n
     B(i) = 1
     C(i) = i
  end do
  !$omp end parallel do

  !$omp parallel do
  do i = 1, n
    A(i) = B(i) + scalar * C(i)
  end do

  !$omp end parallel do

end program

 

With this very quick introduction to how we can parallelize code on CPUs with OpenMP let’s move on to how we can do the same for GPUs, or possibly other parallel accelerator devices.

Device offloading with OpenMP

A more recent addition to OpenMP is the introduction of directives that can be used to facilitate this offloading of code execution to some accelerators. This was first introduced in OpenMP version 4.0 but later refined in OpenMP 4.5 and recently OpenMP 5.0. The full compiler support for OpenMP 5.0 is still a bit uncertain and a lot of compilers currently only support part of the OpenMP 5.0 specifications. However, as time goes on, we expect the support to improve.

Execution Model

The OpenMP programming model for GPUs is the offload model, where we have a thread running on the host CPU and from there we offload part of the computation to an accelerator. To instruct the compiler to offload our code to an accelerator, we use the target directive.

C/C++ Fortran
#pragma omp target
for (size_t i = 0; i < n; i++) {
  A[i] = B[i] + scalar * C[i];
}
!$omp target
do i = 1, n
   A(i) = B(i) + scalar * C(i)
 end do
!$omp end target

 

The standalone target directive is, however, not very useful. In the previous example, our triad code will run on the GPU but will only use one thread. As a consequence, we will not benefit from the massively the parallel nature of the accelerator, nor are we telling the compiler what data need to be moved between the host (CPU) and the device (GPU). In order to leverage all the available level of parallelism on the device, we can use the combined target teams distribute parallel for construct which can be decomposed in 3 parts:

  • target: starts offload
  • teams distribute: create multiple teams of thread and distribute the iteration between them
  • parallel for: distribute the iterations between threads within a team

Using the combined construct, our triad kernel can be written as

C/C++ Fortran
#pragma omp target teams distribute parallel for
for (size_t i = 0; i < n; i++) {
  A[i] = B[i] + scalar * C[i];
}
!$omp target teams distribute parallel for
do i = 1, n
  A(i) = B(i) + scalar * C(i)
end do
!$omp end target teams distribute parallel for

 

 

Adding a work-sharing directives to the target directive allows us to do work in parallel across the accelerator. This has somewhat evolved between OpenMP 4.5 and 5.0. With a compiler that has OpenMP 5.0 offload support one should be able to distribute a loops iteration in a target region by using a teams loop directive, and the compiler will pick a decent way of parallelizing the loop as shown in the example below.

C/C++ Fortran
#pragma omp target teams loop
for (size_t i = 0; i < n; i++) {
  A[i] = B[i] + scalar * C[i];
}
!$omp target teams loop
do i = 1, n
  A(i) = B(i) + scalar * C(i)
end do
!$omp end target teams loop

 

 

In practice the OpenMP 5.0 teams loop construct will produce the same result as the previous example with the longer teams distribute parallel for. The motivation behind the introduction of the loop construct in OpenMP 5.0 is that the programmer may not have an in-depth knowledge of the target device. Instead of expecting the programmer to become an expert, the loop allows exposing the parallelism and let the compiler do the mapping onto the target architecture.

The offloaded loops can also be combined with most of the normal OpenMP loop functions, such as reduction and scan operations, atomic operations, etc., and provide the basic building blocks for many kinds of parallel algorithms to be offloaded.

Data Movement

One of the key things to understand with offloading computation is that all GPUs used in HPCs are separate devices with their own memory space independent of the CPU. As such any data from the host CPU that we plan to use on the GPU needs to be transferred to the GPU before any computation on it can take place. As the connection between the CPU and GPU is far slower than the memory bandwidth one of the key aspects of the performance of offloaded code is also how and more importantly how often data is passed back and forth between the GPU and the host system. Constantly moving data back and forth will likely negate any benefit from offloading the computation to the GPU. Hence it is important that once data is moved to the GPU it stays there for as long as possible and only the parts that are really needed are moved back to the host.

With OpenMP data movement between host and GPU is handled with data directives, these come in two different flavors, region-based and standalone directives. The region based directives work the same way as other OpenMP region directives, it will instruct the compiler on what to do with data when entering the region and what to do with it at the exit of a region. A data region is created using the target data construct. This region is further described by using map clauses to describe how a variable or an array section are moved to and from the device data environment.

For example, in our triad example we can use map(to: B[0:n], C[0:n]) to copy the B and C arrays to the device when the code enters the region and then use map(from: A[0:n]) copies A back to the host once the device execution is done. Note that the description of the data movement is done from the point host i.e. “the CPU”.

C/C++
#pragma omp target data map(to: B[0:n], C[0:n]) map(from: A[0:n])
{
  #pragma omp target teams distribute parallel for
  for (size_t i = 0; i < n; i++) {
    A[i] = B[i] + scalar * C[i];
  }
}
Fortran
!$omp target data map(to: B[0:n], C[0:n]) map(from: A[0:n])
!$omp target teams distribute parallel for
do i = 1, n
  A(i) = B(i) + scalar * C(i)
end do
!$omp end target teams distribute parallel for
!$omp end target data

 

Since the region based directives are based around structured blocks some data movement cannot be described through them, for instance regions starting and ending in different functions or source files. For that there are the standalone directives which can be used to construct more complex data regions, these take the form of two directives, one for entering a data region and one for exiting it. The standalone directives use the same map clauses as the region based ones. We would use target enter data to start the region and target exit data to end the region, and since these can be in different files they are ideal for longer and more complex data regions.

With the standalone data constructs the stream triad code would now look as follows:

C/C++
#pragma omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])
#pragma omp target teams distribute parallel for
for (size_t i = 0; i < n; i++) {
  A[i] = B[i] + scalar * C[i];
}
Fortran
!$omp target enter data map(to: B[0:n], C[0:n]) map(alloc: A[0:n])
!$omp target teams distribute parallel for
do i = 1, n
  A(i) = B(i) + scalar * C(i)
end do
!$omp end target teams distribute parallel for
!$omp target exit data map(from: A[0:n]) map(delete: B[0:n], C[0:n])

 

In the example above, we introduce two new modifiers to the map clause. The first one, alloc, used with the enter data directive allocate the storage for A on the device. The second one, delete, used with the exit data directive deallocate B and C.

With some architecture can use unified shared memory (USM) which allows the CPUs and the GPUs in the system to use a unified address space. It means that the data can be migrated between CPU and GPU memories transparently to the application. You don’t need to explicitly copy data anymore. In order to check if the compiler support USM use the requires directive.

C/C++ Fortran
#pragma omp requires unified_shared_memory
!$omp requires unified_shared_memory

 

All compilers may not support USM and it’s still recommended to use explicit mapping between the host and device for portability. When using USM, these explicit mapping will give more information about data usage to the compiler and increase the potential for optimizations.

Current compiler support

Support for offloading varies for different compiler and accelerator architectures. A general overview can be found here.

On LUMI the AMD GPUs used in that system will be supported from at least the Cray compiler as well as AMDs OpenMP implementation for Clang (initially at least only C/C++ support). Both these compilers are based on LLVM/Clang which supports offloading to NVIDIA and AMD GPUs. More information on the status of the implementation is available here.

If you don’t have access to AMD GPUs but only NVIDIA ones, your best bet is the NVIDIA HPC compiler. This compiler will probably give you the best performance on NVIDIA hardware. However, it only supports a subset of OpenMP 5.0 so you may encounter directives that are not supported. Additional information is available here.

GCC support offloading for both NVIDIA and AMD GPUs since version 10 but the performance may not be at the level as the other compilers mentioned above. More information is available here and here for the implementation status.

The table below summarizes useful compiler flags to compile you OpenMP code with offloading.

NVC/NVFortran Clang/Cray/AMD GCC/GFortran
OpenMP flag -mp -fopenmp -fopenmp -foffload=<target>
Offload flag -mp=gpu -fopenmp-targets=<target> -foffload=<target>
Target NVIDIA default nvptx64-nvidia-cuda nvptx-none
Target AMD n/a amdgcn-amd-amdhsa amdgcn-amdhsa
GPU Architecture -gpu=<cc> -Xopenmp-target -march=<arch> -foffload=”-march=<arch>

 

Building a compiler with support for can be somewhat tricky, the LUMI early access platform will have a version of the Cray compiler with preliminary support for AMD GPUs, in addition to this we are looking at what other compilers we can support on it with the primary target being Clang and/or the AMD compiler. If however you want to try and build it for your own systems, you can have a look at the script here is a script that does it for NVIDIA GPUs and GCC.

OpenACC

On the surface OpenACC offloading works the same way as OpenMP offloading, it has the same basic programming model with the CPU driving the execution and offloading parts of the computation to the GPU. For the code to be offloaded, you need to insert directives instructing the compiler which parts you want offloaded and what data should be moved to and from the device. As such the basic programming ideas are no different than OpenMP, with the differences being mostly in the directives used and some minor functionality differences. However, OpenACC is designed from the ground up to target accelerator while OpenMP is more generalist.

Offloading with OpenACC

Taking the same loop, we have been working on with OpenMP and translating it to OpenACC would make it look as follows.

C/C++
#pragma acc data copyin(B[0:n], C[0:n]) copyout(A[0:n])
{
  #pragma acc parallel loop
  for (size_t i = 0; i < n; i++) {
    A[i] = B[i] + scalar * C[i];
  }
}
Fortran
!$acc data copyin(B[0:n], C[0:n]) copyout(A[0:n])
!$acc parallel loop
do i = 1, n
  A(i) = B(i) + scalar * C(i)
end do
!$acc end parallel loop
!$acc end data

 

As we see the directives are still in the same places and they do the same thing. For offloading the code, we have a parallel loop directive, which is a combined directive: the parallel directive starts parallel execution on the device and the loop work sharing directive distribute the iterations of the loop in parallel. In OpenACC you also have the option of using the kernels directive to offload code. This can be applied to a collection of loops and give the compiler more freedom to choose how they are offloaded. However as there is no equivalent of this in OpenMP one should not be using it if the idea is to write code that can be switched between the two easily.

In the example, the OpenMP structured target data directive is replaced by the OpenACC equivalent: the data directive. Similarly to OpenMP, OpenACC also has standalone directives: enter data and exit data.

Current compiler support

This page lists compilers that support OpenACC. While there are fewer compilers that support OpenACC some of them do offer quite mature support. For instance, NVIDIA’s HPC SDK offers great support. For more information, see here.

Cray’s compiler also offers mature support but not for the latest versions of the standard yet. Support for the latest version of the standard is expected in the months following the commissioning of the LUMI GPU partition.

Support for OpenACC 2.6 is also available in GCC but the implementation is not as mature as the compilers mentioned above. For more information, see here.

 

Authors:

Fredrik Robertsén, Technology strategist, LUMI Leadership Computing Facility

Orian Louant, Applications Specialist, LUMI User Support Team & EuroCC NCC Belgium

Georgios Markomanolis, Lead HPC scientist, CSC

Image: Adobe Stock