Optimizing Bayesian methods for high dimensional data sets on array-based parallel database systems

dc.contributor.advisorOrdonez, Carlos
dc.contributor.committeeMemberVilalta, Ricardo
dc.contributor.committeeMemberBaladandayuthapani, Veerabhadran
dc.creatorCabrera, Wellington 1969-
dc.date.accessioned2017-04-09T23:23:06Z
dc.date.available2017-04-09T23:23:06Z
dc.date.createdDecember 2014
dc.date.issued2014-12
dc.date.submittedDecember 2014
dc.date.updated2017-04-09T23:23:07Z
dc.description.abstractIn this work we solve the problem of variable selection for linear regression on large data sets stored in Database Management Systems (DBMS), under the Bayesian approach. This is a challenging problem due to data sets with a large number of variables present a combinatorial search space. Besides, data sets might be so large that they cannot fit in main memory. In this work, we introduce a three-step algorithm to solve variable selection for large data sets: Pre-selection, Summarization and accelerated Gibbs sampling. Because Markov chain Monte Carlo methods, such as Gibbs sampling, require thousands of iterations for the Markov chain to stabilize, un-optimized algorithms could either run for many hours, or fail due to insufficient main memory. We overcome such issues with several non-trivial database-oriented optimizations, which are experimentally validated in both a parallel Array DBMS and a Row DBMS. We highlight the superiority of an Array DBMS, a novel kind of DBMS, to analyze large matrices. We analyze performance from two perspectives: data set size and dimensionality. Our algorithm is able to identify small subsets of variables that model the dependent variable with reasonable accuracy. We show that our algorithm presents promising performance, specially for high dimensional data sets: our prototype is generally two orders of magnitude faster than a variable selection package for R.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10657/1664
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectDatabases
dc.subjectLinear regression
dc.subjectVariable selection
dc.subjectAlgorithms
dc.titleOptimizing Bayesian methods for high dimensional data sets on array-based parallel database systems
dc.type.dcmiText
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
CABRERA-THESIS-2014.pdf
Size:
3.02 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.83 KB
Format:
Plain Text
Description: