Detection and Analysis of Operational Instability within Distributed Computing Environments

dc.contributor.advisorShi, Weidong
dc.contributor.committeeMemberGabriel, Edgar
dc.contributor.committeeMemberPinsky, Lawrence S.
dc.contributor.committeeMemberShah, Shishir Kirit
dc.contributor.committeeMemberSubhlok, Jaspal
dc.creatorDatskova, Olga Vladimirovna 1986-
dc.date.accessioned2018-03-09T17:20:31Z
dc.date.available2018-03-09T17:20:31Z
dc.date.createdDecember 2017
dc.date.issued2017-12
dc.date.submittedDecember 2017
dc.date.updated2018-03-09T17:20:31Z
dc.description.abstractDistributed computing environments are increasingly deployed over geographically spanning data centers using heterogeneous hardware systems. Failures within such environments incur considerable physical and computing time losses that are unacceptable for large scale scientific processing tasks. At present, resource management systems are limited in detecting and analyzing such occurrences beyond the level of alarms and notifications. The nature of these instabilities is mainly unknown, relying on subsystem expert knowledge and reactivity when they do occur. This work examines performance fluctuations associated with failures within a large scientific distributed production environment. We first present an approach to distinguish between expected operational behavior and service instability occurring within a data center, examined in the context of network quality, production job efficiency and job error-state deviation. This method identifies failure domains to allow for online detection of service-state fluctuations. We then propose a data center stability measure along with an event selection approach, used in analyzing past unstable behavior. We determine, that for detected events, states corresponding to an instability are observed before the occurrence of an event. For select events, we have also discovered a reliability model fit suggesting potential use in predictive analytics. Developed methods are able to detect a pre-failure period, identifying service failure domains affected by the instability. This allows user, central and data center experts to take action in advance of service failure effects, with the view on how this failure will be expected to develop. This work represents an incremental step towards automated proactive management of large distributed computing environments.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10657/2849
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectGrid computing
dc.subjectReliability engineering
dc.subjectPerformance analytics
dc.titleDetection and Analysis of Operational Instability within Distributed Computing Environments
dc.type.dcmiText
dc.type.genreThesis
local.embargo.lift2019-12-01
local.embargo.terms2019-12-01
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DATSKOVA-DISSERTATION-2017.pdf
Size:
9.96 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.44 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.83 KB
Format:
Plain Text
Description: