Browsing by Author "Zhang, Yiqun 1991-"

Now showing 1 - 2 of 2

Multidimensional Aggregations in Parallel Database Systems
(2017-12) Zhang, Yiqun 1991-; Ordóñez, Carlos R.; Gnawali, Omprakash; Fu, Xin; Johnsson, Lennart; Huang, Stephen
Aggregations help computing summaries of a data set, which are ubiquitous in various big data analytics problems. In this dissertation, we provide two major technical contributions which work on parallel database systems that significantly extend the capabilities of aggregations, studying two complementary multidimensional mathematical structures: cubes and matrices. Cubes present a combinatorial problem in a set of discrete dimensions, widely studied in database systems. On the other hand, matrices are widely used in machine learning models, requiring iterative numerical methods taking input as multidimensional vectors. Both problems are difficult to solve on large data sets residing on secondary storage, and their algorithms are difficult to optimize in a parallel cluster. First, we extend cubes to intuitively show the relationship between measures aggregated at different grouping levels by introducing the percentage cube, a generalized database cube that takes percentages as its basic measure instead of simple sums. We show that the percentage cube is significantly harder to compute than the standard cube due to a higher exponential complexity. We propose SQL syntax and introduce novel query optimizations to materialize the percentage cube without any memory limitations. We compare our optimized queries with existing SQL functions, evaluating time, speed-up, and effectiveness of lattice pruning methods. Besides, we show columnar storage provides significant acceleration over row storage, the standard storage mechanism. Second, we study parallel aggregation on large matrices stored as tables and study how to compute a comprehensive data summarization matrix, called Gamma. Gamma generally fits in main memory, and it is shown to enable the efficient derivation of many machine learning models. Specifically, we show our Gamma summarization matrix benefits linear machine learning models, including PCA, linear regression, classification, and variable selection. We analytically show our summarization matrix captures essential statistical properties of the data set and we experimentally show Gamma allows iterative algorithms to iterate faster in main memory. Also, we show Gamma is further accelerated with array and columnar storage. We experimentally prove our parallel aggregations allow faster computation than existing machine learning libraries for model computations in R and Spark, two popular analytic platforms.
Scalable Machine Learning Algorithms in Parallel Database Systems Exploiting A Data Summarization Matrix
(2016-05) Zhang, Yiqun 1991-; Ordonez, Carlos; Gnawali, Omprakash; Fu, Xin
Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. In this work we present a comprehensive data summarization matrix, namely the Gamma matrix, from which we can derive equivalent equations for many analytic algorithms. In this way, iterative algorithms are changed to work in two phases: (1) Incremental and parallel summarization of the data set in one pass; (2) Iteration in main memory exploiting the summarization matrix in many intermediate computations. We show our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work a lot faster in main memory. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression and variable selection. From a system perspective, we carefully study the efficient computation of the summarization matrix in two parallel database systems including the array DBMS SciDB, and the columnar relational DBMS HP Vertica. We also propose general optimizations according to the data density and system-dependent optimizations for each platform. We present an experimental evaluation benchmarking system and algorithm performance. Our experiments show that our algorithms work significantly faster than existing machine learning libraries for model computations in R and Spark, and R working together with SciDB in general can run our algorithm significantly faster than all the other parallel analytic systems compared. More importantly, it eliminates main memory and performance limitations from R.