MPI Based Python Libraries for Data Science Applications
Rodgers, John Scott
MetadataShow full item record
Tools commonly leveraged to tackle large-scale data science workflows have traditionally shied away from existing high performance computing paradigms, largely due to their lack of fault tolerance and computation resiliency. However, these concerns are typically only of critical importance to problems tackled by technology companies at the highest level. For the average data scientist, the benefits of resiliency may not be as important as the overall execution performance. To this end, the work of this thesis aims to develop prototypes of tools favored by the data science community that function in a data-parallel environment, taking advantage of functionality commonly used in high performance computing. To achieve this goal, a prototype-distributed clone of the Python NumPy library and a select module from the SciPy library were developed, which leverage MPI for inter-process communication and data transfers while abstracting away the complexity of MPI programming from its users. Through various benchmarks, the overhead introduced by logic necessary to resolve functioning in a data-parallel environment, as well as the scalability of using parallel compute resources for routines commonly used by the emulated libraries, are analyzed. For the distributed NumPy clone, it was found that for routines that could act solely on their local array contents, the impact of the introduced overhead was minimal; while for routines that required global scope of distributed elements, a considerable amount of overhead was introduced. In terms of scalability, both the distributed NumPy clone and select SciPy module, a distributed implementation of K-Means clustering, exhibited reasonably performant results; notably showing sensitivity to local process problem sizes and operations that required large amounts of collective communication/synchronization. As this work mainly focused on the initial exploration and prototyping of behavior, the results of the benchmarks can be used in future development efforts to target operations for refinement and optimization.