Performance models for parallel applications under failures

Date

2017-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Due to the growing size of compute clusters, large scale parallel applications increasingly have to deal with hardware malfunctions and other failure scenarios during execution. The overall goal of this research is to get good performance of parallel applications despite failures. This dissertation introduces two mathematical models to improve resilience of parallel applications on two different frameworks. The first one is a mathematical model to minimize job completion time for inter-dependent parallel processes running in a volunteer environment by finding the optimal checkpoint interval. Validation is performed with a sample real world application running on a pool of distributed volunteer nodes. The results shows that the predicted checkpoint interval gives performance closed to optimal checkpoint interval determined empirically after extensive experimentation.

The second part of the dissertation evaluates the performance of Hadoop MapReduce applications, with different execution parameters and under different failure scenarios. The dissertation introduces performance models for Hadoop MapReduce applications considering node and process failures. Having a performance model allows to determine optimal settings for some of the parameters, such as split size. Validation of the model is done by running two MapReduce applications with different parameter settings. The results show that different applications require different settings for the same MapReduce parameters and the proposed model can predict the performance very well.

Description

Keywords

Performance models, Parallel Execution, Fault tolerance, Volunteer computing, Checkpointing, Replication, Host Selection, MapReduce, Hadoop

Citation

Portions of this document appear in: Rahman, Mohammad Tanvir, Hien Nguyen, Jaspal Subhlok, and Gopal Pandurangan. "Checkpointing to minimize completion time for inter-dependent parallel processes on volunteer grids." In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 331-335. IEEE, 2016. And in: Rahman, Mohammad Tanvir, Edgar Gabriel, and Jaspal Subhlok. "Performance implications of failures on MapReduce applications." In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 741-748. IEEE, 2017.