A Parallel Implementation of the Pandas Framework

dc.contributor.advisorGabriel, Edgar
dc.contributor.committeeMemberSolorio, Thamar
dc.contributor.committeeMemberLindner, Peggy
dc.creatorKhan, Saba Hafeez
dc.date.accessioned2020-06-02T04:35:27Z
dc.date.available2020-06-02T04:35:27Z
dc.date.createdMay 2020
dc.date.issued2020-05
dc.date.submittedMay 2020
dc.date.updated2020-06-02T04:35:28Z
dc.description.abstractHigh-performance is a highly desirable trait for applications today. Companies large and small are migrating their serial applications to parallel versions to reduce execution time and increase efficiency. However, preparing serial applications for parallel processing is not a simple process. Pandas, which is a Python library containing rich data structures and tools, is used abundantly in Data Science applications. However, Pandas framework is built for single-core processing and is unable to fully utilize multi-core processors or cluster technology. Because of this limitation, Pandas users are forced to look for other frameworks when working with large quantities of data. This thesis introduces a Parallel-Pandas library which makes the process of parallelizing serial Pandas applications easy and transparent. The Parallel-Pandas library provides Pandas users the ability to upgrade existing applications transparently, by using only a library import. This thesis contains details about the design decisions and implementation of the Parallel-Pandas library. The Parallel-Pandas library is evaluated with unit testing, microbenchmarks, and a real-world application with different datasets. Parallel-Pandas has also been compared with PySpark, a framework that provides parallelism by following the MapReduce structure. The results presented in this paper show that the Parallel-Pandas library has promising potential and delivers performance close to manually parallelized and tuned applications.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/10657/6602
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectHigh Performance Computing
dc.subjectPandas
dc.subjectMPI
dc.subjectData Science
dc.subjectDuplicate Detection
dc.subjectBig Data
dc.subjectPython
dc.subjectParallel-Pandas
dc.subjectmpi4py
dc.titleA Parallel Implementation of the Pandas Framework
dc.type.dcmiText
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KHAN-THESIS-2020.pdf
Size:
905.15 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.42 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.81 KB
Format:
Plain Text
Description: