Install OpenMPI to /usr/local:
wget https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.2.tar.gz
tar xzf openmpi-3.1.2.tar.gz
cd openmpi-3.1.2
make all
sudo make install
Executing mpirun requires setting LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/usr/local/lib
MPI for Python provides MPI bindings for Python. Check out the docs: MPI for Python.
Install module:
pip install mpi4py
tar -xvf  nccl_2.2.12-1+cuda9.0_x86_64.txz
sudo mkdir /usr/local/nccl-2.2.12
sudo cp -r nccl_2.2.12-1+cuda9.0_x86_64/* /usr/local/nccl-2.2.12
Create a file /etc/ld.so.conf.d/nccl.conf with content:
/usr/local/nccl-2.2.12/lib
Run ldconfig to update LD_LIBRARY_PATH:
sudo ldconfig 
Create symbolic link for NCCL header file:
sudo ln -s /usr/local/nccl-2.2.12/include/nccl.h /usr/include/nccl.h
Here is the link to the Horovod docs
For installation on machines with GPUs read this: Horovod GPU page
Install Horovod with NCCL support:
HOROVOD_NCCL_HOME=/usr/local/nccl-2.2.12 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
There are a number of things to consider when running Python MPI applications on multiple machines:
- When using Anaconda, the Anaconda 
./bindirectory must be in thePATHbefore the system Python executables - The MPI 
./bindirectory must be in thePATH - The MPI 
./libdirectory must be in theLD_LIBRARY_PATH - Your Python code must exist on all machines in the same location
 
MPI uses a non-interactive shell for launching processes on remote machines. The best way to setup PATH
and LD_LIBRARY_PATH variables is to add them to /etc/environment.
MPI uses ssh to launch processes on remote machines. Install certificates so that ssh works without passwords between machines that serve MPI processes.
The MPI processes on different machines communicate over TCP connections. Make sure there are no firewalls blocking the communication.
Also you should disable SSH host key checking by creating a file ~/.ssh/config with content
Host *
   StrictHostKeyChecking no
   UserKnownHostsFile=/dev/null
Running env from one MPI machine on another MPI machine is a good test to check all the points:
ssh ${REMOTE_SERVER_HOST} env
vaderBTL is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.smBTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth mechanism for transferring data between two processes via shared memory. This BTL can only be used between processes executing on the same node.tcpBTL direct Open MPI to use TCP-based communications over IP interfaces / networks.
Create a MPI hostfile mpi_hosts that specifies network addresses and number of slots:
${HOSTNAME1} slots=${NB_SLOTS}
${HOSTNAME2} slots=${NB_SLOTS}
...
Send data from one process to another.
mpirun -np 2 --hostfile mpi_hosts --mca btl self,tcp python mpi_point_to_point.py
Broadcasting takes a variable and sends an exact copy of it to all processes.
mpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_broadcast.py
> Rank:  0 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]
> Rank:  1 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]
> Rank:  2 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]
> Rank:  3 , data received:  [0. 0.34888889 0.69777778 1.04666667 1.39555556 1.74444444 2.09333333 2.44222222 2.79111111 3.14 ]
Scatter takes an array and distributes contiguous sections of it to different processes.
mpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_scatter.py
> Rank:  0 , recvbuf received:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
> Rank:  1 , recvbuf received:  [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]
> Rank:  2 , recvbuf received:  [21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]
> Rank:  3 , recvbuf received:  [31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]
The reverse of a scatter is a gather, which takes subsets of an array that are distributed across the processes, and gathers them back into the full array.
mpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_gather.py
> Rank:  0 , sendbuf:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
> Rank:  1 , sendbuf:  [11. 12. 13. 14. 15. 16. 17. 18. 19. 20.]
> Rank:  2 , sendbuf:  [21. 22. 23. 24. 25. 26. 27. 28. 29. 30.]
> Rank:  3 , sendbuf:  [31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]
> Rank:  0 , recvbuf received:  [ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40.]
The reduce operation takes values in from an array on each process and reduces them to a single result on the root process.
mpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_reduce.py
> Rank:  0  value =  0.0
> Rank:  1  value =  1.0
> Rank:  2  value =  2.0
> Rank:  3  value =  3.0
> Rank 0: value_sum = 6.0
> Rank 0: value_max = 3.0
The allreduce operation takes values in from an array on each process, reduces them to a single result and sends the result to each process. Note that the communication pattern is much more complex compared to the reduce operation.
mpirun -np 4 --hostfile mpi_hosts --mca btl self,tcp python mpi_allreduce.py
> Rank  0 value= 0.0
> Rank  1 value= 1.0
> Rank  2 value= 2.0
> Rank  3 value= 3.0
> Rank 0 value_sum= 6.0
> Rank 0 value_max= 3.0
> Rank 1 value_sum= 6.0
> Rank 2 value_sum= 6.0
> Rank 2 value_max= 3.0
> Rank 3 value_sum= 6.0
> Rank 3 value_max= 3.0
> Rank 1 value_max= 3.0
mpirun -np 2 \
    --hostfile mpi_hosts \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 --mca btl self,tcp \
    python -u hvd_allreduce.py
mpirun -np 2 \
    --hostfile mpi_hosts \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 --mca btl self,tcp \
    python -u hvd_broadcast.py
mpirun -np 2 \
    --hostfile mpi_hosts \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 --mca btl self,tcp \
    python -u hvd_allgather.py
To run on a machine with 8 GPUs:
time mpirun -np 2 \
    --hostfile mpi_hosts \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -mca pml ob1 --mca btl self,tcp \
    python tensorflow_mnist.py