A Parallel Implementation of the Pandas Framework

Date

2020-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

High-performance is a highly desirable trait for applications today. Companies large and small are migrating their serial applications to parallel versions to reduce execution time and increase efficiency. However, preparing serial applications for parallel processing is not a simple process. Pandas, which is a Python library containing rich data structures and tools, is used abundantly in Data Science applications. However, Pandas framework is built for single-core processing and is unable to fully utilize multi-core processors or cluster technology. Because of this limitation, Pandas users are forced to look for other frameworks when working with large quantities of data. This thesis introduces a Parallel-Pandas library which makes the process of parallelizing serial Pandas applications easy and transparent. The Parallel-Pandas library provides Pandas users the ability to upgrade existing applications transparently, by using only a library import. This thesis contains details about the design decisions and implementation of the Parallel-Pandas library. The Parallel-Pandas library is evaluated with unit testing, microbenchmarks, and a real-world application with different datasets. Parallel-Pandas has also been compared with PySpark, a framework that provides parallelism by following the MapReduce structure. The results presented in this paper show that the Parallel-Pandas library has promising potential and delivers performance close to manually parallelized and tuned applications.

Description

Keywords

High Performance Computing, Pandas, MPI, Data Science, Duplicate Detection, Big Data, Python, Parallel-Pandas, mpi4py

Citation