Browsing by Author "Cabrera, Wellington 1969-"

Now showing 1 - 2 of 2

OPTIMIZED ALGORITHMS FOR DATA ANALYSIS IN PARALLEL DATABASE SYSTEMS
(2017-05) Cabrera, Wellington 1969-; Ordonez, Carlos; Gabriel, Edgar; Gnawali, Omprakash; Han, Zhu
Large data sets are generally stored on disk following an organization as rows, columns or arrays, with row storage being the most common. On the other hand, matrix multiplication is frequently found in machine learning algorithms as an important primitive operation. Since database management systems do not support matrix operations, analytical tasks are commonly performed outside the database system, in external libraries or mathematical tools. In this work, we optimize several analytic algorithms that benefit from a fast in-database matrix multiplication. Specifically, we study how to compute in-database parallel matrix multiplication to solve two major family of big data analytics problems: machine learning models and graph algorithms We focus on three cases: the product of a matrix by its transposed, the powers of a square matrix and iteration of matrix-vector multiplication. Based on this foundation, we introduce important optimizations to the computation of fundamental linear models in machine learning: linear regression, variable selection and principal components analysis. On the other hand, we present parallel graph algorithms that take advantage of matrix powers and parallel vector multiplication to solve several graph problems: transitive closure, all pairs shortest paths, reachability from a single source vertex, single source shortest paths, connected components and PageRank.
Optimizing Bayesian methods for high dimensional data sets on array-based parallel database systems
(2014-12) Cabrera, Wellington 1969-; Ordonez, Carlos; Vilalta, Ricardo; Baladandayuthapani, Veerabhadran
In this work we solve the problem of variable selection for linear regression on large data sets stored in Database Management Systems (DBMS), under the Bayesian approach. This is a challenging problem due to data sets with a large number of variables present a combinatorial search space. Besides, data sets might be so large that they cannot fit in main memory. In this work, we introduce a three-step algorithm to solve variable selection for large data sets: Pre-selection, Summarization and accelerated Gibbs sampling. Because Markov chain Monte Carlo methods, such as Gibbs sampling, require thousands of iterations for the Markov chain to stabilize, un-optimized algorithms could either run for many hours, or fail due to insufficient main memory. We overcome such issues with several non-trivial database-oriented optimizations, which are experimentally validated in both a parallel Array DBMS and a Row DBMS. We highlight the superiority of an Array DBMS, a novel kind of DBMS, to analyze large matrices. We analyze performance from two perspectives: data set size and dimensionality. Our algorithm is able to identify small subsets of variables that model the dependent variable with reasonable accuracy. We show that our algorithm presents promising performance, specially for high dimensional data sets: our prototype is generally two orders of magnitude faster than a variable selection package for R.