AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS
Ayyalasomayajula, Haripriya 1992-
MetadataShow full item record
The focus of companies like Google, Amazon etc. is to gain competitive business advantage from the insights drawn by processing petabytes of data. Big Data refers to data characterized by large volume, great variety, and ubiquitous nature of its sources. MapReduce is a programming model that provides a highly scalable and efficient solution to analyze massive datasets on large-scale commodity clusters. Though Hadoop, its open source implementation became a de facto for parallel processing of batch workloads, it is inefficient for iterative, incremental algorithms, ad hoc queries, and stream processing. Apache Spark is a general-purpose cluster-computing framework, which supports in-memory data analytics. It preserves the merits offered by Hadoop and overcomes its limitations. This thesis aims at evaluating the performance offered by the Spark programming model for Big Data Analytics. Code has been developed to perform analyses of historic air quality data using Spark and MapReduce. It involved significant development effort and tuning the configuration parameters. It is observed that Spark offers a performance of up to 20% more than MapReduce. Applying Machine Learning techniques to Big Data forms the core of data analytics. MLib is a scalable machine-learning library, offered by the Spark eco-system. To extend our analyses, we perform clustering on the air-quality dataset and evaluate the performance, clustering quality and usability of K-Means Clustering algorithm implementation provided by Spark MLib library against that of Apache Mahout. We tried to develop code to evaluate Spark's ability to integrate with HBase as a data source. Though the initial test cases ran successfully with small dataset, due to insufficient documentation available currently, this is reserved for future work.