On Communication-Computation Overlap in High-Performance Computing



Journal Title

Journal ISSN

Volume Title



The number of compute nodes and cores per node have increased many fold on high end computer systems over the last decade. For a parallel application to scale to tens or even hundreds of thousands of processes, all non-computing related operations have to be kept at an absolute minimum, including communication operations.

Non-blocking-collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. However, it has been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Also, using non-blocking-collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication-computation overlap.

The goal of this dissertation is to optimize non-blocking collective-communication operations and facilitate their utilization at the end-user level. This is achieved by an automatic run-time tuning of non-blocking collective-communication operations, which allows the communication library to maximize communication-computation overlap, and choose the best performing implementation for a non-blocking-collective operation on a case by case basis. Specifically, an approach to maximize the communication-computation overlap for hybrid OpenMP/MPI applications is developed. It leverages automatic parallelization by extending existing concepts to utilize non-blocking-collective operations. It also integrates the run-time auto-tuning techniques of non-blocking-collective operations, optimizing both, the algorithms used for the non-blocking-collective operations as well as location and frequency of accompanying progress-function calls.



High performance computing, Message Passing Interface, MPI, Collective Operations, Communication-computation Overlap, Optimization, Distributed memory, Automatic Parallelization


Portions of this document have appeared in: Barigou, Youcef, Vishwanath Venkatesan, and Edgar Gabriel. "Auto-tuning non-blocking collective communication operations." In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pp. 1204-1213. IEEE, 2015. DOI: 10.1109/IPDPSW.2015.15.