A General Summarization Matrix for Scalable Machine Learning Model Computation in the R Language



Journal Title

Journal ISSN

Volume Title



Data analysis is an essential task for research. Modern large datasets indeed contain a high volume of data and may require a parallel DBMS, Hadoop Stack, or parallel clusters to analyze them. We propose an alternative approach to these methods by using a lightweight language/system like R to compute Machine Learning models on such datasets. This approach eliminates the need to use cluster/parallel systems in most cases, thus, it paves the way for an average user to effectively utilize its functionality. Specifically, we aim to eliminate the physical memory, time, and speed limitations, that are currently present within packages in R when working with a single machine. R is a powerful language, and it is very popular for its data analysis. However, R is significantly slow and does not allow flexible modifications, and the process of making it faster and more efficient is cumbersome. To address the drawbacks mentioned thus far, we implemented our approach in two phases. The first phase dealt with the construction of a summarization matrix, Γ, from a one-time scan of the source dataset, and it is implemented in C++ using the RCpp package. There are two forms of this Γ matrix, Diagonal and Non-Diagonal Gamma, each of which is efficient for computing specific models. The second phase used the constructed Γ Matrix to compute Machine Learning models like PCA, Linear Regression, Na¨ıve Bayes, K-means, and similar models for analysis, which is then implemented in R. We bundled our whole approach into a R package, titled Gamma.



Summarization, Gamma, Machine learning, Linear regression, PCA, Naïve Bayes, K-Means, R machine learning