Ordonez, Carlos2019-09-132019-09-13May 20192019-05May 2019https://hdl.handle.net/10657/4465Data analysis is an essential task for research. Modern large datasets indeed contain a high volume of data and may require a parallel DBMS, Hadoop Stack, or parallel clusters to analyze them. We propose an alternative approach to these methods by using a lightweight language/system like R to compute Machine Learning models on such datasets. This approach eliminates the need to use cluster/parallel systems in most cases, thus, it paves the way for an average user to effectively utilize its functionality. Specifically, we aim to eliminate the physical memory, time, and speed limitations, that are currently present within packages in R when working with a single machine. R is a powerful language, and it is very popular for its data analysis. However, R is significantly slow and does not allow flexible modifications, and the process of making it faster and more efficient is cumbersome. To address the drawbacks mentioned thus far, we implemented our approach in two phases. The first phase dealt with the construction of a summarization matrix, Γ, from a one-time scan of the source dataset, and it is implemented in C++ using the RCpp package. There are two forms of this Γ matrix, Diagonal and Non-Diagonal Gamma, each of which is efficient for computing specific models. The second phase used the constructed Γ Matrix to compute Machine Learning models like PCA, Linear Regression, Na¨ıve Bayes, K-means, and similar models for analysis, which is then implemented in R. We bundled our whole approach into a R package, titled Gamma.application/pdfengThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).SummarizationGammaMachine learningLinear regressionPCANaïve BayesK-MeansR machine learningA General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language2019-09-13Thesisborn digital