A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance

dc.contributor.advisorChapman, Barbara M.
dc.contributor.committeeMemberShamis, Pavel
dc.contributor.committeeMemberGabriel, Edgar
dc.creatorHao, Pengfei 1989-
dc.date.createdMay 2016
dc.date.submittedMay 2016
dc.description.abstractThe Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large scale. The PGAS ecosystem contains libraries, and languages (often implemented atop those libraries). One such library is OpenSHMEM, which offers an intuitive and easy-to-use API. OpenSHMEM's main feature is one-sided communication: in which communication and computation can be overlapped easily. Performing computational science at large scale requires a resilient computing environment. Current computer systems, although generally reliable, do suffer from occasional faults. As the size of leadership high performance computing systems trends towards Exascale, the presence of faults will lead to system failures that cause fatal software failures. To mitigate against this problem requires software resilience, or "fault tolerance." One common approach is to checkpoint and restart from a known good state when an error is detected. A long-running (e.g., weeks or months) program without fault tolerance will suffer from failure-restart cycles, which introduces unacceptably lengthy, uncertain execution times, and hugely increased resource usage. In this thesis work, we explore a fault tolerance scheme based on check-point and restart that is specialized for the needs of PGAS programming models, using OpenSHMEM as a concrete implementation. Using a 1-D Jacobi code, we show that this kind of approach is scalable and can save considerable resource usage. Ideas for more general solutions and other approaches are presented as future work.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.identifier.citationPortions of this document have appeared in: Hao, Pengfei, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, and Barbara Chapman. "Fault tolerance for openshmem." In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, p. 23. ACM, 2014. And in: Hao, Pengfei, Swaroop Pophale, Pavel Shamis, Tony Curtis, and Barbara Chapman. "Check-pointing approach for fault tolerance in OpenSHMEM." In Workshop on OpenSHMEM and Related Technologies, pp. 36-52. Springer, Cham, 2014.
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectFault tolerance
dc.titleA Checkpointing Restart Approach for OpenSHMEM Fault Tolerance
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.nameMaster of Science