A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance

Date

2016-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large scale. The PGAS ecosystem contains libraries, and languages (often implemented atop those libraries). One such library is OpenSHMEM, which offers an intuitive and easy-to-use API. OpenSHMEM's main feature is one-sided communication: in which communication and computation can be overlapped easily.

Performing computational science at large scale requires a resilient computing environment. Current computer systems, although generally reliable, do suffer from occasional faults. As the size of leadership high performance computing systems trends towards Exascale, the presence of faults will lead to system failures that cause fatal software failures. To mitigate against this problem requires software resilience, or "fault tolerance." One common approach is to checkpoint and restart from a known good state when an error is detected.

A long-running (e.g., weeks or months) program without fault tolerance will suffer from failure-restart cycles, which introduces unacceptably lengthy, uncertain execution times, and hugely increased resource usage. In this thesis work, we explore a fault tolerance scheme based on check-point and restart that is specialized for the needs of PGAS programming models, using OpenSHMEM as a concrete implementation. Using a 1-D Jacobi code, we show that this kind of approach is scalable and can save considerable resource usage. Ideas for more general solutions and other approaches are presented as future work.

Description

Keywords

OpenSHMEM, Fault tolerance, PGAS, One-sided

Citation

Portions of this document have appeared in: Hao, Pengfei, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, and Barbara Chapman. "Fault tolerance for openshmem." In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, p. 23. ACM, 2014. And in: Hao, Pengfei, Swaroop Pophale, Pavel Shamis, Tony Curtis, and Barbara Chapman. "Check-pointing approach for fault tolerance in OpenSHMEM." In Workshop on OpenSHMEM and Related Technologies, pp. 36-52. Springer, Cham, 2014.