Scalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix

dc.contributor.advisorOrdonez, Carlos
dc.contributor.committeeMemberGnawali, Omprakash
dc.contributor.committeeMemberFu, Xin
dc.creatorZhang, Yiqun 1991- 2016
dc.description.abstractData summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression and variable selection. From a system perspective, we carefully study the efficient computation of the summarization matrix in two parallel database systems including the array DBMS SciDB, and the columnar relational DBMS HP Vertica. We also propose general optimizations according to the data density and system-dependent optimizations for each platform. We present an experimental evaluation benchmarking system and algorithm performance. Our experiments show that our algorithms work significantly faster than existing machine learning libraries for model computations in R and Spark, and R working together with SciDB in general can run our algorithm significantly faster than all the other parallel analytic systems compared. More importantly, it eliminates main memory and performance limitations from R.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectData summarization
dc.subjectMatrix multiplication
dc.subjectMachine learning
dc.subjectBig data analytics
dc.titleScalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix
dc.type.genreThesis of Natural Sciences and Mathematics Science, Department of Science of Houston of Science


Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
232.16 KB
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
1.81 KB
Plain Text