Multidimensional Aggregations in Parallel Database Systems

Zhang, Yiqun 1991-

Multidimensional Aggregations in Parallel Database Systems

dc.contributor.advisor	Ordóñez, Carlos R.
dc.contributor.committeeMember	Gnawali, Omprakash
dc.contributor.committeeMember	Fu, Xin
dc.contributor.committeeMember	Johnsson, Lennart
dc.contributor.committeeMember	Huang, Stephen
dc.creator	Zhang, Yiqun 1991-
dc.creator.orcid	0000-0002-5055-5720
dc.date.accessioned	2020-01-03T21:47:37Z
dc.date.available	2020-01-03T21:47:37Z
dc.date.created	December 2017
dc.date.issued	2017-12
dc.date.submitted	December 2017
dc.date.updated	2020-01-03T21:47:38Z
dc.description.abstract	Aggregations help computing summaries of a data set, which are ubiquitous in various big data analytics problems. In this dissertation, we provide two major technical contributions which work on parallel database systems that significantly extend the capabilities of aggregations, studying two complementary multidimensional mathematical structures: cubes and matrices. Cubes present a combinatorial problem in a set of discrete dimensions, widely studied in database systems. On the other hand, matrices are widely used in machine learning models, requiring iterative numerical methods taking input as multidimensional vectors. Both problems are difficult to solve on large data sets residing on secondary storage, and their algorithms are difficult to optimize in a parallel cluster. First, we extend cubes to intuitively show the relationship between measures aggregated at different grouping levels by introducing the percentage cube, a generalized database cube that takes percentages as its basic measure instead of simple sums. We show that the percentage cube is significantly harder to compute than the standard cube due to a higher exponential complexity. We propose SQL syntax and introduce novel query optimizations to materialize the percentage cube without any memory limitations. We compare our optimized queries with existing SQL functions, evaluating time, speed-up, and effectiveness of lattice pruning methods. Besides, we show columnar storage provides significant acceleration over row storage, the standard storage mechanism. Second, we study parallel aggregation on large matrices stored as tables and study how to compute a comprehensive data summarization matrix, called Gamma. Gamma generally fits in main memory, and it is shown to enable the efficient derivation of many machine learning models. Specifically, we show our Gamma summarization matrix benefits linear machine learning models, including PCA, linear regression, classification, and variable selection. We analytically show our summarization matrix captures essential statistical properties of the data set and we experimentally show Gamma allows iterative algorithms to iterate faster in main memory. Also, we show Gamma is further accelerated with array and columnar storage. We experimentally prove our parallel aggregations allow faster computation than existing machine learning libraries for model computations in R and Spark, two popular analytic platforms.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.citation	Portions of this document appear in: Zhang, Yiqun, Carlos Ordonez, Javier García-García, and Ladjel Bellatreche. "Optimization of Percentage Cube Queries." In EDBT/ICDT Workshops. 2017. And in: Zhang, Yiqun, Carlos Ordonez, and Wellington Cabrera. "Big data analytics integrating a parallel columnar DBMS and the R language." In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 627-630. IEEE, 2016. And in: Ordonez, Carlos, Yiqun Zhang, and Wellington Cabrera. "The Gamma matrix to summarize dense and sparse data sets for big data analytics." IEEE Transactions on Knowledge and Data Engineering 28, no. 7 (2016): 1905-1918.
dc.identifier.uri	https://hdl.handle.net/10657/5686
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	Databases
dc.subject	Multidimensional
dc.subject	Aggregation
dc.subject	Parallel
dc.subject	Cube
dc.subject	OLAP
dc.subject	Machine learning
dc.subject	Data summarization
dc.subject	PCA
dc.subject	Regression
dc.subject	Percentage
dc.title	Multidimensional Aggregations in Parallel Database Systems
dc.type.dcmi	Text
dc.type.genre	Thesis
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ZHANG-DISSERTATION-2017.pdf
Size:: 816.14 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 4.43 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 1.81 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection