AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS

dc.contributor.advisorGabriel, Edgar
dc.contributor.committeeMemberShi, Weidong
dc.contributor.committeeMemberPrice, Daniel M.
dc.creatorAyyalasomayajula, Haripriya 1992-
dc.date.accessioned2015-08-28T03:36:49Z
dc.date.available2015-08-28T03:36:49Z
dc.date.createdMay 2015
dc.date.issued2015-05
dc.date.updated2015-08-28T03:36:50Z
dc.description.abstractThe focus of companies like Google, Amazon etc. is to gain competitive business advantage from the insights drawn by processing petabytes of data. Big Data refers to data characterized by large volume, great variety, and ubiquitous nature of its sources. MapReduce is a programming model that provides a highly scalable and efficient solution to analyze massive datasets on large-scale commodity clusters. Though Hadoop, its open source implementation became a de facto for parallel processing of batch workloads, it is inefficient for iterative, incremental algorithms, ad hoc queries, and stream processing. Apache Spark is a general-purpose cluster-computing framework, which supports in-memory data analytics. It preserves the merits offered by Hadoop and overcomes its limitations. This thesis aims at evaluating the performance offered by the Spark programming model for Big Data Analytics. Code has been developed to perform analyses of historic air quality data using Spark and MapReduce. It involved significant development effort and tuning the configuration parameters. It is observed that Spark offers a performance of up to 20% more than MapReduce. Applying Machine Learning techniques to Big Data forms the core of data analytics. MLib is a scalable machine-learning library, offered by the Spark eco-system. To extend our analyses, we perform clustering on the air-quality dataset and evaluate the performance, clustering quality and usability of K-Means Clustering algorithm implementation provided by Spark MLib library against that of Apache Mahout. We tried to develop code to evaluate Spark's ability to integrate with HBase as a data source. Though the initial test cases ran successfully with small dataset, due to insufficient documentation available currently, this is reserved for future work.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10657/1130
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectBig data analytics
dc.subjectHadoop
dc.subjectMapReduce
dc.subjectApache Spark
dc.subjectMLib
dc.subjectHBase
dc.subject.lcshComputer science
dc.titleAN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS
dc.type.dcmiText
dc.type.genreThesis
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AYYALASOMAYAJULA-THESIS-2015.pdf
Size:
472.21 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.83 KB
Format:
Plain Text
Description: