A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language

dc.contributor.advisorOrdonez, Carlos
dc.contributor.committeeMemberEick, Christoph F.
dc.contributor.committeeMemberKaiser, Klaus
dc.creatorChebolu, Siva Uday Sampreeth 1995-
dc.date.accessioned2019-09-13T01:18:08Z
dc.date.available2019-09-13T01:18:08Z
dc.date.createdMay 2019
dc.date.issued2019-05
dc.date.submittedMay 2019
dc.date.updated2019-09-13T01:18:09Z
dc.description.abstractData analysis is an essential task for research. Modern large datasets indeed contain a high volume of data and may require a parallel DBMS, Hadoop Stack, or parallel clusters to analyze them. We propose an alternative approach to these methods by using a lightweight language/system like R to compute Machine Learning models on such datasets. This approach eliminates the need to use cluster/parallel systems in most cases, thus, it paves the way for an average user to effectively utilize its functionality. Specifically, we aim to eliminate the physical memory, time, and speed limitations, that are currently present within packages in R when working with a single machine. R is a powerful language, and it is very popular for its data analysis. However, R is significantly slow and does not allow flexible modifications, and the process of making it faster and more efficient is cumbersome. To address the drawbacks mentioned thus far, we implemented our approach in two phases. The first phase dealt with the construction of a summarization matrix, Γ, from a one-time scan of the source dataset, and it is implemented in C++ using the RCpp package. There are two forms of this Γ matrix, Diagonal and Non-Diagonal Gamma, each of which is efficient for computing specific models. The second phase used the constructed Γ Matrix to compute Machine Learning models like PCA, Linear Regression, Na¨ıve Bayes, K-means, and similar models for analysis, which is then implemented in R. We bundled our whole approach into a R package, titled Gamma.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/10657/4465
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectSummarization
dc.subjectGamma
dc.subjectMachine learning
dc.subjectLinear regression
dc.subjectPCA
dc.subjectNaïve Bayes
dc.subjectK-Means
dc.subjectR machine learning
dc.titleA General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language
dc.type.dcmiText
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
CHEBOLU-THESIS-2019.pdf
Size:
255.52 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.44 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.83 KB
Format:
Plain Text
Description: