AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS

Ayyalasomayajula, Haripriya 1992-

AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS

dc.contributor.advisor	Gabriel, Edgar
dc.contributor.committeeMember	Shi, Weidong
dc.contributor.committeeMember	Price, Daniel M.
dc.creator	Ayyalasomayajula, Haripriya 1992-
dc.date.accessioned	2015-08-28T03:36:49Z
dc.date.available	2015-08-28T03:36:49Z
dc.date.created	May 2015
dc.date.issued	2015-05
dc.date.updated	2015-08-28T03:36:50Z
dc.description.abstract	The focus of companies like Google, Amazon etc. is to gain competitive business advantage from the insights drawn by processing petabytes of data. Big Data refers to data characterized by large volume, great variety, and ubiquitous nature of its sources. MapReduce is a programming model that provides a highly scalable and efficient solution to analyze massive datasets on large-scale commodity clusters. Though Hadoop, its open source implementation became a de facto for parallel processing of batch workloads, it is inefficient for iterative, incremental algorithms, ad hoc queries, and stream processing. Apache Spark is a general-purpose cluster-computing framework, which supports in-memory data analytics. It preserves the merits offered by Hadoop and overcomes its limitations. This thesis aims at evaluating the performance offered by the Spark programming model for Big Data Analytics. Code has been developed to perform analyses of historic air quality data using Spark and MapReduce. It involved significant development effort and tuning the configuration parameters. It is observed that Spark offers a performance of up to 20% more than MapReduce. Applying Machine Learning techniques to Big Data forms the core of data analytics. MLib is a scalable machine-learning library, offered by the Spark eco-system. To extend our analyses, we perform clustering on the air-quality dataset and evaluate the performance, clustering quality and usability of K-Means Clustering algorithm implementation provided by Spark MLib library against that of Apache Mahout. We tried to develop code to evaluate Spark's ability to integrate with HBase as a data source. Though the initial test cases ran successfully with small dataset, due to insufficient documentation available currently, this is reserved for future work.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10657/1130
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	Big data analytics
dc.subject	Hadoop
dc.subject	MapReduce
dc.subject	Apache Spark
dc.subject	MLib
dc.subject	HBase
dc.subject.lcsh	Computer science
dc.title	AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS
dc.type.dcmi	Text
dc.type.genre	Thesis
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Masters
thesis.degree.name	Master of Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: AYYALASOMAYAJULA-THESIS-2015.pdf
Size:: 472.21 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: LICENSE.txt
Size:: 1.83 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection