Introduction

background

Look at Mr. Robota's this tweet regarding MPI, and how do each process work together (mainly regarding the compilation of standard output)? I was curious about the buffer control and the point, so I did a rough search.

Caution

This time, I haven't written in detail about the environment etc., so please be sure to take the back and use it by yourself without taking it **. In addition, please note that the terms used are quite attached to the texto.

In addition, I tried Linux environment, and there are three types of MPI: Intel, SGI-MPT, and OpenMPI. Also, Fortran is not mentioned at all. For C / C ++.

What is standard output buffering?

What is buffering?

First of all, about "buffering".

When outputting data from standard output, it is done through stdout for C and std :: cout for C ++, but even if you call printf or std :: ostream :: operator <<, the output will be reflected immediately. It is not always done. This is because calling the OS APIs (write and send) that are actually in charge of output in small pieces is generally disadvantageous in terms of performance, and in the standard library, it is stored in a buffer to some extent and lumped. This is because I try to spit it out all at once. This behavior is called ** buffering **.

Buffering in the standard library

There are three modes of buffering:

Unbuffered Buffering disabled, spit out immediately every time you call an output function
Full buffered Accumulate output until the buffer is full or a flush operation is performed
Row buffered Close to full buffered, but spits out line by line if there is a line break

For C stdio, control is done with setbuf / setvbuf. The default control depends on the file / device to which the standard output is connected, line buffered for TTY / PTY, and fully buffered otherwise. The flush operation is performed with the fflush function.

For C ++ iostream, control is done by whether to set the std :: ios_base :: unitbuf flag to std :: cout (std :: cout.setf or unsetf). If it is set, it is unbuffered, if it is not set, it is fully buffered, and there is no row buffered. The default is full buffered. The flush operation is performed with the I / O manipulator std :: flush or std :: endl.

… And, like this, the control is different for C and C ++, but in the case of mixed C / C ++ code, it is usually a problem if the output order is disturbed, so by default both are synchronized. I will. In other words, even if it is fully buffered on the C ++ side, if the C side is not, it will be pulled to the C side. However, this behavior can be overridden by std :: ios_base :: sync_with_stdio (false).

How MPI works

Mechanism overview

Roughly speaking, what MPI is is that if you specify the nodes that can be mobilized and the number of processes to be executed, multiple nodes can start the same (or even heterogeneous) program and perform cooperative calculations. It is a library and a group of tools.

Therefore, there are various high-speed networks such as InfiniBand as cooperation between the started programs, but this time I think about the "standard output flow". Therefore, it is enough to consider the three factors shown in the following figure.

** This is an arbitrary term **,

Front In short, tools such as mpirun and mpiexec that start MPI and control the whole. It is also the coordinator of the final standard output.
manager A program that manages programs running on each node (plural)
Worker Responsible for general applications that operate according to the MPI framework and calculation processing according to the purpose

It will be classified as. In the above figure, the front and other nodes are drawn as if they were running on different nodes, but they may be the same.

Differences by MPI

Intel MPI First is Intel MPI. It looks like the following figure.

The roles and responses are as follows.

Front This corresponds to mpiexec (mpiexec is also called even if it is started by mpirun etc.). In the past, there were many methods that started mpd in advance, but nowadays, a method called hydra that does not go through mpd is probably the standard. (So it should be called mpiexec.hydra to be exact)
manager This is pmi_proxy. For the local node, start it directly from the front, and for the remote node, use ssh to start it.
However, there may be an environment in which a resource manager / job scheduler such as slurm can be used instead. This also applies to OpenMPI.

After starting, the manager connects to the front via TCP / IP, aggregates the output piped from the worker, and sends it to the front.

As an aside, the port that the front listens to is basically indefinite (although it can be narrowed down to some extent by environment variables etc.), so it is difficult to seriously control it with a firewall.

SGI MPT Next is SGI MPT.

The roles and responses are as follows.

Front This is mpirun.
manager There is no special program, and the program specified as a worker acts as a manager.
So it may seem strange if you are accustomed to other MPIs, but if you specify a shell script with mpirun, the script itself will only start one process, and the MPI program called from there will have multiple workers. Will produce

The division of roles is similar to Intel MPI, but starting the manager from the front requires a daemon called arrayd (even if it is a local node) that comes with SGI MPT.

OpenMPI Finally, OpenMPI.

The roles and responses are as follows.

Front This is mpirun.
manager orted is applicable. However, in the case of a local node, mpirun doubles as a manager and starts the worker directly.

The big difference from the above two MPIs, as well as the handling of local nodes, is that the interaction between managers and workers becomes PTY.

So, if there is a simultaneous notification via wall or shutdown, the output may interfere ...?

Buffering standard output in MPI

Reorganization of output path

If you output with an MPI program, we will look at what happens when it is finally aggregated and output to the front.

As organized above, three types of programs, front manager and worker, work together when MPI is executed. And ** worker output is aggregated to the front via the manager **. Therefore, it is necessary to organize buffering for each route. That is,

Worker → Manager
Manager → Front
Front → Final output destination

There are three places.

Differences in buffering for each MPI

Intel MPI In the case of Intel MPI, the output of each worker is mixed even in the middle of the line. So it looks like buffering is disabled.

This is because ** the MPI library internally calls setbuf / setvbuf during MPI_Init to put the worker in an unbuffered state **. In other words, the part of worker-> manager is buffering disabled, and the manager-> front, front-> final output destination is spilled as it is without any particular control, so it looks like buffering is disabled as a whole.

Therefore, after MPI_Init, you can enable buffering by calling setbuf / setvbuf and reconfiguring the buffering. In addition, it seems that the flag of std :: cout is not changed in either MPI_Init or MPI: Init, so if it is a pure C ++ application, buffering will be enabled by disabling C, C ++ synchronization. ..

2018/10/21 postscript: As commented by @ n_so5, it seems that the movement is equivalent to line buffered by the option -ordered-output at the time of mpiexec. Then setbuf / setvbuf would not be needed. However, there was a note in the manual to remember the last line break in the output. Standard errors also seem to be affected. The following is an excerpt from Chapter 2.3.1 of Intel MPI 2019 Developer Reference.

-ordered-output Use this option to avoid intermingling of data output from the MPI processes. This option affects both the standard output and the standard error streams. NOTE When using this option, end the last output line of each process with the end-of-line '\n' character. Otherwise the application may stop responding.

SGI MPT In the case of SGI MPT, the output is organized line by line, which is equivalent to line buffered behavior.

The mechanism behind this is a bit complicated.

Worker → Manager Like Intel MPI, make it unbuffered at MPI_Init
Manager → Front No particular control
Front → Final output destination The front separates the output for each worker and buffers it independently (row buffered)

In other words, the buffering is done by the efforts of the front desk. Conversely, it may be that you do not want the MPI application (worker) to control the buffer without permission.

OpenMPI Like SGI MPT, OpenMPI behaves like row buffered.

This mechanism is very simple, because the output between worker and manager is PTY, and stdout control for it is row buffered by default. Others Manager-> Front, Front-> Final output destination does not seem to control anything special. In other words, OpenMPI itself does not specifically deal with buffering control, it is left to the standard library.

Summary

So, I have seen the difference in control in each MPI.

Although there are differences in each MPI, if you want to make sure that buffering works, I think it is better to call setbuf / setvbuf immediately after MPI_Init.

2018/10/21 postscript: Intel MPI can also be made equivalent to line buffered with the runtime option -ordered-output, so in the case of the three MPIs discussed this time, it seems that there is no need to touch the program side. ..

reference

Below, the source and operation log when trying the behavior with Intel MPI are listed for reference.

`Operation log`


$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core) 
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
$ icpc --version
icpc (ICC) 18.0.1 20171018
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

$ mpiicpc -std=gnu++11 -o test test.cpp
$ mpirun -np 2 ./test
abababababababababbababababababababababa

abababababababababababababababababababab
a
bbabaababababababababababababababababab
a
bababbaababababbaababababababababababab

babaabababababababababababababababababab

$ mpirun -np 2 ./test --nosync
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test --setvbuf
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test --nosync --unitbuf
abababbaabababababababababababababababab
a
bababababababababababababababababababab

babababababababababababababababababababa

ababababbabababababababababababababababa

abababababababababababababababababababab

$ mpiicpc -std=gnu++11 -o test2 test2.cpp
$ mpirun -np 2 ./test2
abababababbaababababababbaababababababab

babaabababbaabababababababababababababab

babababababababababababababababababababa

ababababbaabababababbabaabababababababab
a
bababababababababababababababababababab

$ mpirun -np 2 ./test2 -f
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test2 -l
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$

`test.cpp`


#include <mpi.h>
#include <iostream>
#include <thread>
#include <string>
#include <cstdio>

static char stdoutbuf[8192];

int main(int argc, char **argv) {
  MPI::Init(argc,argv);
  MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
  int rank = MPI::COMM_WORLD.Get_rank();

  for ( int i=1; i<argc; i++ ) {
    std::string opt(argv[i]);
    if ( opt == "--nosync" ) {
      // detach C++-iostream from C-stdio
      std::ios_base::sync_with_stdio(false);
    }
    else if ( opt == "--setvbuf" ) {
      // re-setvbuf for C-stdio
      std::setvbuf(stdout,stdoutbuf,_IOFBF,sizeof(stdoutbuf));
    }
    else if ( opt == "--unitbuf" ) {
      // disable buffering on C++-iostream
      std::cout.setf(std::ios_base::unitbuf);
    }
    else if ( rank == 0 ) {
      std::cerr << "invalid option: " << opt << std::endl;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
  }

  char c='a'+rank;
  for ( int i=0; i<5; i++ ) {
    MPI::COMM_WORLD.Barrier();
    for ( int j=0; j<20; j++ ) {
      std::cout << c;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
    std::cout << std::endl;
  }
  MPI::Finalize();
}

`test2.cpp`


#include <mpi.h>
#include <iostream>
#include <thread>
#include <string>
#include <cstdio>

static char stdoutbuf[8192];

int main(int argc, char **argv) {
  MPI::Init(argc,argv);
  MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
  int rank = MPI::COMM_WORLD.Get_rank();

  if ( argc > 1 ) {
    std::string opt(argv[1]);
    if ( opt == "-f" ) {
      // full buffered
      std::setvbuf(stdout,stdoutbuf,_IOFBF,sizeof(stdoutbuf));
    }
    else if ( opt == "-l" ) {
      // line buffered
      std::setvbuf(stdout,stdoutbuf,_IOLBF,sizeof(stdoutbuf));
    }
  }

  char c='a'+rank;
  for ( int i=0; i<5; i++ ) {
    MPI::COMM_WORLD.Barrier();
    for ( int j=0; j<20; j++ ) {
      std::cout << c;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
    std::cout << '\n';
  }
  std::cout << std::flush;
  MPI::Finalize();
}

Coordination of each process in MPI and buffering of standard output

Introduction

background

Caution

What is standard output buffering?

What is buffering?

Buffering in the standard library

How MPI works

Mechanism overview

Differences by MPI

Buffering standard output in MPI

Reorganization of output path

Differences in buffering for each MPI

Summary

reference

Operation log

test.cpp

test2.cpp

`Operation log`

`test.cpp`

`test2.cpp`