• Login
    View Item 
    •   Repository Home
    • Electronic Theses and Dissertations
    • Published ETD Collection
    • View Item
    •   Repository Home
    • Electronic Theses and Dissertations
    • Published ETD Collection
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    MPI Based Python Libraries for Data Science Applications

    Thumbnail
    View/Open
    RODGERS-THESIS-2020.pdf (4.419Mb)
    Date
    2020-05
    Author
    Rodgers, John Scott
    Metadata
    Show full item record
    Abstract
    Tools commonly leveraged to tackle large-scale data science workflows have traditionally shied away from existing high performance computing paradigms, largely due to their lack of fault tolerance and computation resiliency. However, these concerns are typically only of critical importance to problems tackled by technology companies at the highest level. For the average data scientist, the benefits of resiliency may not be as important as the overall execution performance. To this end, the work of this thesis aims to develop prototypes of tools favored by the data science community that function in a data-parallel environment, taking advantage of functionality commonly used in high performance computing. To achieve this goal, a prototype-distributed clone of the Python NumPy library and a select module from the SciPy library were developed, which leverage MPI for inter-process communication and data transfers while abstracting away the complexity of MPI programming from its users. Through various benchmarks, the overhead introduced by logic necessary to resolve functioning in a data-parallel environment, as well as the scalability of using parallel compute resources for routines commonly used by the emulated libraries, are analyzed. For the distributed NumPy clone, it was found that for routines that could act solely on their local array contents, the impact of the introduced overhead was minimal; while for routines that required global scope of distributed elements, a considerable amount of overhead was introduced. In terms of scalability, both the distributed NumPy clone and select SciPy module, a distributed implementation of K-Means clustering, exhibited reasonably performant results; notably showing sensitivity to local process problem sizes and operations that required large amounts of collective communication/synchronization. As this work mainly focused on the initial exploration and prototyping of behavior, the results of the benchmarks can be used in future development efforts to target operations for refinement and optimization.
    URI
    https://hdl.handle.net/10657/6601
    Collections
    • Published ETD Collection

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by 
    Atmire NV
     

     

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsDepartmentsTitlesSubjectsThis CollectionBy Issue DateAuthorsDepartmentsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by 
    Atmire NV