A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance
dc.contributor.advisor | Chapman, Barbara M. | |
dc.contributor.committeeMember | Shamis, Pavel | |
dc.contributor.committeeMember | Gabriel, Edgar | |
dc.creator | Hao, Pengfei 1989- | |
dc.date.accessioned | 2018-07-17T16:48:39Z | |
dc.date.available | 2018-07-17T16:48:39Z | |
dc.date.created | May 2016 | |
dc.date.issued | 2016-05 | |
dc.date.submitted | May 2016 | |
dc.date.updated | 2018-07-17T16:48:39Z | |
dc.description.abstract | The Partitioned Global Address Space (PGAS) has emerged recently for parallel programming at large scale. The PGAS ecosystem contains libraries, and languages (often implemented atop those libraries). One such library is OpenSHMEM, which offers an intuitive and easy-to-use API. OpenSHMEM's main feature is one-sided communication: in which communication and computation can be overlapped easily. Performing computational science at large scale requires a resilient computing environment. Current computer systems, although generally reliable, do suffer from occasional faults. As the size of leadership high performance computing systems trends towards Exascale, the presence of faults will lead to system failures that cause fatal software failures. To mitigate against this problem requires software resilience, or "fault tolerance." One common approach is to checkpoint and restart from a known good state when an error is detected. A long-running (e.g., weeks or months) program without fault tolerance will suffer from failure-restart cycles, which introduces unacceptably lengthy, uncertain execution times, and hugely increased resource usage. In this thesis work, we explore a fault tolerance scheme based on check-point and restart that is specialized for the needs of PGAS programming models, using OpenSHMEM as a concrete implementation. Using a 1-D Jacobi code, we show that this kind of approach is scalable and can save considerable resource usage. Ideas for more general solutions and other approaches are presented as future work. | |
dc.description.department | Computer Science, Department of | |
dc.format.digitalOrigin | born digital | |
dc.format.mimetype | application/pdf | |
dc.identifier.citation | Portions of this document have appeared in: Hao, Pengfei, Pavel Shamis, Manjunath Gorentla Venkata, Swaroop Pophale, Aaron Welch, Stephen Poole, and Barbara Chapman. "Fault tolerance for openshmem." In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, p. 23. ACM, 2014. And in: Hao, Pengfei, Swaroop Pophale, Pavel Shamis, Tony Curtis, and Barbara Chapman. "Check-pointing approach for fault tolerance in OpenSHMEM." In Workshop on OpenSHMEM and Related Technologies, pp. 36-52. Springer, Cham, 2014. | |
dc.identifier.uri | http://hdl.handle.net/10657/3272 | |
dc.language.iso | eng | |
dc.rights | The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s). | |
dc.subject | OpenSHMEM | |
dc.subject | Fault tolerance | |
dc.subject | PGAS | |
dc.subject | One-sided | |
dc.title | A Checkpointing Restart Approach for OpenSHMEM Fault Tolerance | |
dc.type.dcmi | Text | |
dc.type.genre | Thesis | |
thesis.degree.college | College of Natural Sciences and Mathematics | |
thesis.degree.department | Computer Science, Department of | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | University of Houston | |
thesis.degree.level | Masters | |
thesis.degree.name | Master of Science |