Configuration parameters

When the library is initialized calling distdir_initialize, a t_config structure is internally created. It contains the settings of the library which can be specified as environment variables or using API functions. At the moment, the following settings are supported:

exchanger type: it can be specified using the environment variable DISTDIR_EXCHANGER or the API function set_config_exchanger
verbose mode: it can be specified using the environment variable DISTDIR_VERBOSE or the API function set_config_verbose

The exchanger type specifies the type of exchange. An enumerator is defined internally:

IsendIrecv1=0 : each buffer to be sent is filled and then sent with a call to MPI_Isend, while all the buffers to be received are received with a call to MPI_Irecv, then MPI_Waitall is called and the received buffers are unpacked into the field data array
IsendIrecv2=1 : all buffers to be sent are first filled and then sent with calls to MPI_Isend, while all the buffers to be received are received with calls to MPI_Irecv, then MPI_Waitall is called and the received buffers are unpacked into the field data array
IsendRecv1=2 : each buffer to be sent is filled and then sent with a call to MPI_Isend, then MPI_Waitall l is called. Finally, all the buffers to be received are received with a call to MPI_Recv and as the final step all the received buffers are unpacked into the field data array
IsendRecv2=3 : each buffer to be sent is filled and then sent with a call to MPI_Isend, then MPI_Waitall is called. Finally, all the buffers to be received are received with a call to MPI_Recv and immediately unpacked into the field data array.
IsendIrecv1NoWait=4 : same as IsendIrecv1 but the sending messages are waited during the next call to the function or during the call to delete_exchanger. It can be used only when the sending processes and the receiving processes do not overlap, i.e. concurrent implementations.
IsendIrecv2NoWait=5 : same as IsendIrecv2 but the sending messages are waited during the next call to the function or during the call to delete_exchanger. It can be used only when the sending processes and the receiving processes do not overlap, i.e. concurrent implementations.
IsendRecv1NoWait=6 : same as IsendRecv1 but the sending messages are waited during the next call to the function or during the call to delete_exchanger. It can be used only when the sending processes and the receiving processes do not overlap, i.e. concurrent implementations.
IsendRecv2NoWait=7 : same as IsendRecv2 but the sending messages are waited during the next call to the function or during the call to delete_exchanger. It can be used only when the sending processes and the receiving processes do not overlap, i.e. concurrent implementations.

The default exchanger is IsendIrecv1. The environment variable would set this parameter globally, while the API allows to set it per t_exchanger object. This means that given the same map, fields exchanged with different exchangers but having the same communication path, can use different type of exchange. In this case the API function must be called before the call to new_exchanger, which creates a t_exchanger object.

The verbose mode specifies if the library should run in verbose mode or not. An enumerator is defined internally:

verbose_true=0
verbose_false=1

This parameter is self explanatory.

Mapping 3D fields

The mapping of 3D fields is computationally more expensive and the choice of the RD decomposition is crucial to achieve good performance in computing the exchange pattern (t_map object). The idea of the library to tackle this problem is to optimize the mapping algorithm for 2D fields and then adapt it to the third dimension. This is straightforward if an application used only 2D domain decomposition, such as many climate and NWP applications.

In this case, the optimal way to use the library is to create a t_map object for the exchange of 3D fields in two stages:

create a t_map object calling the new_map API function and providing the global indices of the 2D decomposition for the first vertical level
use the t_map object to create another t_map object which extrapolate the exchange pattern in the third dimension, where there is no domain decomposition. This is achieved using the extend_map_3d API function which requires the t_map object based on the 2D decomposition on a single vertical level and the number of level on which the user wants to extrapolate the t_map object. A new t_map object is created and can be used to create the t_exchanger objects, thus the initally create t_map object can be immediately deleted.

The creation of maps with extrapolation is very convenient and it should provide optimal performance. However, it is not applicable to a large range of applications. In particular, this method is not applicable to applications having 3D domain decomposition, such as any CFD application.

In this case, the new_map function must be used directly to create the t_map object for 3D fields. However, the user can provide an additional argument to the function which allows to create bucket strides based on the 2D global domain. The stride parameter should have the value of the number of points in the global 2D domain. Internally, the library is creating buckets based on the stride parameter which are identical to the ones which would be generated with a 2D slice of the same field. Then, the buckets are strided and there will be one stride of each bucket on each vertical level of the 3D field. Then, the forward mapping to the RD decomposition and the backward mapping to the original decomposition is done as usual. The 3D mapping done in this way is still significantly more computationally expensive than the 2D mapping but it is expected to still provide good performance which covers a wide range of applications.

Exchanger methods

In the Configuration parameters section, the different exchanger types supported by the library have been already introduced and explained.

The purpose of this section is to provide some guidance on the usage of the exchanger types based on different applications. As it was explained in the Rationale page, the main goal of the library is to remove the complexity of the exchange when concurrent execution is implemented in a scientific application. The supported exchanger types mirrors this basic idea which originally drove the library development.

Concurrent execution is crucial for many scientific application to target exascale systems because it allows to optimally scale each component and to run each component on the optimal hardware. Concurrent execution always involves exchange points between components. This exchange can be in one direction or in both directions.

In case of two way exchange, both components send and receive information during an exchange point. This means that only IsendIrecv1, IsendIrecv2, IsendRecv1 or IsendRecv2 can be used. The optimal exchanger is highly dependent on the application itself and the MPI implementation. The developers recommend to test the different exchanger types for each specific usage.

In case of one way exchange, IsendIrecv1, IsendIrecv2, IsendRecv1 or IsendRecv2 can be used but they would be suboptimal because the sending processes do not need to wait for the messages to be received. For this reason, IsendIrecv1NoWait, IsendIrecv2NoWait, IsendRecv1NoWait or IsendRecv2NoWait are recommended. In this case, the sending processes fill the buffers and start a non blocking send. Afterwards they continue with the next step and they will wait for the sending messages only right before the next exchange. In this way, overlapping of communication and computation can be fully achieved on the sending processes. This is crucial if we think about a real case application. A classic example is a climate model which uses IO servers to perform output files writing during the time loop. In this case, the communication between the client processes and the server processes is one way. Using the nowait exchangers, the client processes fill the buffer and call MPI_Isend without waiting for the messages to be received and continue with the heavy computations. During the following writing step, they wait for the messages of the previous writing step and then fill the buffer and send the new data.

The nowait exchangers can only be used when the sending and receiving processes belong to two non overlapping groups. This means that they can not be used then the data are sent to a subset of processes which is another well known strategy to optimize the output files writing or when a transposition is done for spectral method. The range of applications using the nowait exchangers is limited but those cases are widely used and they are expected to be fully optimized with those exchangers. As mentioned for the other exchangers, the choice between the nowait exchangers depends on the application and the MPI implementation, this the users should test the different nowait exchanger types for their specific use cases.

Memory layout transformation

Climate applications usually apply a runtime transformation to the memory layout for caching purposes. This means that if a 3D subdomain has a memory layout [npoints_local, nlevels] at runtime it can be transformed [nproma, nlevels, nblocks] where nproma is a runtime parameter. The library supports also this memory layout transformation. In this case, the index list has to be generated providing the global indices in the original layout. This means that the generated map is also based on the original layout.

When the data needs to be exchanged the API function exchanger_go_with_transform can be used to transform the original memory layout in the one defined at runtime. In this case, the transformation for the source and destination needs to be provided in the form of indices in a integer array.

An example:

for (int block = 0; block<nblocks; block++)
    for (int level = 0; level<nlevels; level++)
        for (int k=0; k<nproma; k++)
            transform[k+level*nproma+block*nlevels*nproma] = k+level*npoints_local+block*nproma;

During the packing of the buffer, the transformation is applied before filling the buffer and during the unpacking of the buffer, the transformation is applied before filling the data field.

GPU backend

The new_exchanger function requires a hardware type argument.

The hardware type specifies if the MPI ranks are running on CPU or GPU. An enumerator is defined internally:

CPU=0
GPU_NVIDIA=1
GPU_AMD=2

The function API allows to set the hardware per MPI rank and per exchanger object. This means that a group of MPI processes can run on the GPU and send data to a group of MPI processes running on the CPU, but it also means that a group of MPI processes can send data from the GPU with one exchanger and send other data from the CPU with another exchanger using the same previously generated map.

If the exchanger runs on GPU, the pointer to the data passed to the exchanger_go or the exchanger_go_with_transform functions must be a GPU pointer. In case of memory layout transformation, the transform arrays need to be on GPU as well.

Table of Contents

Configuration parameters

Mapping 3D fields

Exchanger methods

Memory layout transformation

GPU backend