Skip to content

MPI Design

SPMD: Single Program Multi. Data

1. P2P Communication

1.1 Blocking

MPI_Send will not return until the send buffer is usable for new data again.

MPI_Recv will not return until the receive buffer is available for reading.

System buffer???

1.2 Non-Blocking

Wait or Test:

MPI_Request request;
MPI_Status status;
MPI_Isend(start, count, datatype, dest, tag, comm, &request);
MPI_Irecv(start, count, datatype, dest, tag, comm, &request); MPI_Wait(&request, &status);
MPI_Wait(&request, &status);
MPI_Test(&request, &flag, &status);

1.3 Dead lock

Unsafe pattern below!!!

Screen Shot 2022-03-22 at 11.25.19

  • System buffer has a limited size.
  • Before Receiving, system buffer is used for storing data to send
  • When size of data to send > size of system buffer, the receiver has to provide space

Solution:

  • re-order; potential risk: sequential! waste bandwidth

    // process 0
    Send(1);
    Recv(1);
    // Process 1
    Recv(0);
    Send(1);
    
  • Sendrecv

    // process 0
    SendRecv(1);
    // Process 1
    SendRecv(0);
    

    SendRecv allows simultaneous sending and receiving.

    It provides receive buffer when sending.

  • Bsend: self-provide buffer for sending.

    // process 0
    Bsend(1);
    Recv(1);
    // Process 1
    Bsend(0);
    Recv(0);
    
  • partially Async; potential risk: sequential! waste bandwidth

  • (Recommended) Isend + Irecv + Waitall for both process

e.g. Mesh exchange

-> Principle and Lessons Learned: Delay Sync Operations

1.4 P2P Protocols

Eager Protocol

  • send to buffer directly ; no confirmation required -> reduce sync delay
  • local copy needed
  • enough space required; small message

Rendezvous Protocol

  • send envelop first for confirmation -> need time to sync
  • no local data copy
  • big data

2. Process Mapping

Map processes to physical devices reasonably

  • Different communication needs in different processes
  • Different performance (bandwidth, delay) of different nodes in a cluster

Problem Abstraction: Graph Mapping

Screen Shot 2022-03-22 at 11.41.23

  • NP-Complete Problem
  • Heuristic algorithm

Two simple mapping ways:

Screen Shot 2022-03-22 at 11.42.56

Why???

  • Block
  • Cyclic

Process Binding

  • reduce cost of switching, cache miss, and cross NUMA access

3. Collective Communication

3.1 Bcast and Reduce

Broadcast A from P1 to all of the processes. (1-to-all)

Screen Shot 2022-03-09 at 5.22.36 PM

Flat Tree

  • (P - 1) sendings
  • Bandwidth waste

Binomial Tree

Van De Geijn: Bcast + Allgather

Reduce data from all of the processes to P2 with sum op. (all-to-1)

Screen Shot 2022-03-09 at 5.32.30 PM

3.2 Scatter and Gather

Scatter: Distribute different data to different processes. (1-to-all)

  • The root node has all of the data at the beginning.
  • The root node sends different parts of the data to different processes.

Screen Shot 2022-03-09 at 5.34.51 PM

Gather: The root node gathers data from all of the processes. (all-to-1)

Screen Shot 2022-03-09 at 5.36.26 PM

3.3 All-X

3.3.1 Allgather

Allgather = Gather + Bcast

Screen Shot 2022-03-09 at 5.38.17 PM

Ring

  • (P - 1) steps
  • No bandwidth waste

Bruck

Recursive Doubling

3.3.2 Allreduce

Reduce + Bcast

Screen Shot 2022-03-09 at 5.39.20 PM

Ring: Reduce-Scatter + Allgather

3.3.3 Alltoall

Irecv-Isend

Pairwise

Alltoall: Each process gathers different data from all of the processes. (n ✖️ Gather)

Screen Shot 2022-03-09 at 5.41.32 PM

4. Communicator and Group

MPI::COMM_WORLD : Pre-defined communicator for all of the processes

  • Group: within which processes can communicate to each other
  • Communicator: each one binds to a group; to provide communication functions

Screen Shot 2022-03-09 at 5.47.14 PM

5. MPI-IO

POSIX (Portable Operating System Interface) v.s. MPI:

image-20220309180303178

MPI-IO can:

  • Open the file only once
  • R/W simultaneously

Independent IO (traditional?)

  • Each process has a request
  • Serial when the same file is requested by different processes

Collective IO

  • Shared R/W buffer: combine many queries into a single one -> reduce IO request
  • sync needed

Screen Shot 2022-03-09 at 5.59.19 PM

6. Performance

6.1 Performance Model

  • alpha-beta model: Time = alpha (latency) + n * beta (cost per byte) = latency + n / bandwidth

    • For most cases, alpha >> beta >> cost per FLOP

      Cost of a big data < Cost of many small data , e.g.

      alpha + n * beta < n * (alpha + 1 * beta)

    • Computing to Communication Ratio should be large enough, as cost per FLOP is small

  • LogP Model: Latency / overhead / gap / Proc

6.2 Profiling

PMPI

mpiP

OSU Benchmark

https://panthema.net/2013/pmbw/index.html#downloads


Last update: March 29, 2022
Authors: Co1lin