Scalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix

Zhang, Yiqun 1991-

Scalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix

Files

ZHANG-THESIS-2016.pdf (232.16 KB)

Date

2016-05

Authors

Zhang, Yiqun 1991-

Abstract

Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression and variable selection. From a system perspective, we carefully study the efficient computation of the summarization matrix in two parallel database systems including the array DBMS SciDB, and the columnar relational DBMS HP Vertica. We also propose general optimizations according to the data density and system-dependent optimizations for each platform. We present an experimental evaluation benchmarking system and algorithm performance. Our experiments show that our algorithms work significantly faster than existing machine learning libraries for model computations in R and Spark, and R working together with SciDB in general can run our algorithm significantly faster than all the other parallel analytic systems compared. More importantly, it eliminates main memory and performance limitations from R.

Keywords

Data summarization, Array, DBMS, Matrix multiplication, Machine learning, Big data analytics

URI

http://hdl.handle.net/10657/1482

Collections

Published ETD Collection

Full item page

Scalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections