A Fast Clustering Algorithm Merging The Expectation Maximization Algorithm and Markov Chain Monte Carlo

dc.contributor.advisorOrdonez, Carlos
dc.contributor.committeeMemberEick, Christoph F.
dc.contributor.committeeMemberAzencott, Robert
dc.creatorMatusevich, David Sergio 1969-
dc.creator.orcid0000-0001-7494-095X
dc.date.accessioned2017-07-24T18:21:36Z
dc.date.available2017-07-24T18:21:36Z
dc.date.createdMay 2015
dc.date.issued2015-05
dc.date.submittedMay 2015
dc.date.updated2017-07-24T18:21:37Z
dc.description.abstractClustering is an important problem in Statistics and Machine Learning that is usually solved using Likelihood Maximization methods, of which the Expectation-Maximization algorithm (EM) is the most common. In this work we present an algorithm merging Markov Chain Monte Carlo methods with the EM algorithm to find qualitatively better solutions for the clustering problem. We present brief introductions to two popular clustering algorithms, K-Means and EM, as well as the Markov Chain Monte Carlo algorithm. We show how these algorithms can be combined and incorporated into a Database Management System (DBMS) using a combination of SQL queries and User Defined Functions (UDFs). Even though SQL is not optimized for complex calculations, as it is constrained to work on tables and columns, it is unparalleled in handling all aspects of storage management, security of the information, fault management, etc. Our algorithm makes use of these characteristics to produce portable solutions that are comparable to the results obtained by other algorithms and are more efficient since the calculations are all performed inside the DBMS. To simplify the calculation we use very simple scalar UDFs, of a type that is available in most DBMS. The solution has linear time complexity on the size of the data set and it has a linear speedup with the number of servers in the cluster. This was achieved using sufficient statistics and a simplified model that assigns the data-points to different clusters during the E-step in an incremental manner and the introduction of a Sampling step in order to explore the solution space in a more efficient manner. Preliminary experiments show very good agreement with standard solutions.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.citationPortions of this document appear in: Matusevich, David, and Carlos Ordonez. "A clustering algorithm merging MCMC and EM methods using SQL queries." In Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pp. 61-76. 2014. http://proceedings.mlr.press/v36/matusevich14.pdf
dc.identifier.urihttp://hdl.handle.net/10657/1935
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectData mining
dc.subjectClustering
dc.subjectExpectation maximization
dc.subjectBayesian Methods
dc.subjectMarkov Chain Monte Carlo
dc.titleA Fast Clustering Algorithm Merging The Expectation Maximization Algorithm and Markov Chain Monte Carlo
dc.type.dcmitext
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MATUSEVICH-THESIS-2015.pdf
Size:
741.24 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.82 KB
Format:
Plain Text
Description: