First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.
Achieving working and portable passive RMA with MPI-2 involves several steps.
Step 1: Window allocation in the target process
For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM
:
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
}
MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
...
MPI_Win_free(win);
MPI_Free_mem(schedule);
Step 2: Memory synchronisation at the target
The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):
It is erroneous to have concurrent conflicting accesses to the same memory location in a
window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.
Therefore each access to schedule[]
in the target has to be protected by a lock (shared since it only reads the memory location):
while (!ready)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads
(disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):
6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.
If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related
synchronization call.
This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[]
if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):
Advice to users. A user can write correct programs by following the following rules:
...
lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are
protected by shared locks, both for local accesses and for RMA accesses.
Step 3: Providing the correct parameters to MPI_PUT
at the origin
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win);
would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int)
is:
int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
The local value of one
is thus transferred into rank * sizeof(int)
bytes following the beginning of the window at the target. If disp_unit
was set to 1, the correct put would be:
MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);
Step 4: Dealing with implementation specifics
The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc
(one-sided communication) framework comes in two implementations - rdma
and pt2pt
. The default (in Open MPI 1.6.x and probably earlier) is rdma
and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK
is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER
in your case). On the other hand the pt2pt
module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt
component:
$ mpiexec --mca osc pt2pt ...
A fully working C99 sample code follows:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
static double starttime = -1.0;
int diff = 0;
for (int i = 0; i < size; i++)
diff |= (schedule[i] != oldschedule[i]);
if (diff)
{
int res = 0;
if (starttime < 0.0) starttime = MPI_Wtime();
printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
for (int i = 0; i < size; i++)
{
printf("%d", schedule[i]);
res += schedule[i];
oldschedule[i] = schedule[i];
}
printf("
");
return(res == size-1);
}
return 0;
}
int main (int argc, char **argv)
{
MPI_Win win;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
int *oldschedule = malloc(size * sizeof(int));
// Use MPI to allocate memory for the target window
int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);
for (int i = 0; i < size; i++)
{
schedule[i] = 0;
oldschedule[i] = -1;
}
// Create a window. Set the displacement unit to sizeof(int) to simplify
// the addressing at the originator processes
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
int ready = 0;
while (!ready)
{
// Without the lock/unlock schedule stays forever filled with 0s
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
ready = fnz(schedule, oldschedule, size);
MPI_Win_unlock(0, win);
}
printf("All workers checked in using RMA
");
// Release the window
MPI_Win_free(&win);
// Free the allocated memory
MPI_Free_mem(schedule);
free(oldschedule);
printf("Master done
");
}
else
{
int one = 1;
// Worker processes do not expose memory in the window
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// Simulate some work based on the rank
sleep(2*rank);
// Register with the master
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
MPI_Win_unlock(0, win);
printf("Worker %d finished RMA
", rank);
// Release the window
MPI_Win_free(&win);
printf("Worker %d done
", rank);
}
MPI_Finalize();
return 0;
}
Sample output with 6 processes:
$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule: 0 0 0 0 0 0
[ 1.995] Schedule: 0 1 0 0 0 0
Worker 1 finished RMA
[ 3.989] Schedule: 0 1 1 0 0 0
Worker 2 finished RMA
[ 5.988] Schedule: 0 1 1 1 0 0
Worker 3 finished RMA
[ 7.995] Schedule: 0 1 1 1 1 0
Worker 4 finished RMA
[ 9.988] Schedule: 0 1 1 1 1 1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done