Detection and Analysis of Operational Instability within Distributed Computing Environments

Datskova, Olga Vladimirovna 1986-

Detection and Analysis of Operational Instability within Distributed Computing Environments

dc.contributor.advisor	Shi, Weidong
dc.contributor.committeeMember	Gabriel, Edgar
dc.contributor.committeeMember	Pinsky, Lawrence S.
dc.contributor.committeeMember	Shah, Shishir Kirit
dc.contributor.committeeMember	Subhlok, Jaspal
dc.creator	Datskova, Olga Vladimirovna 1986-
dc.date.accessioned	2018-03-09T17:20:31Z
dc.date.available	2018-03-09T17:20:31Z
dc.date.created	December 2017
dc.date.issued	2017-12
dc.date.submitted	December 2017
dc.date.updated	2018-03-09T17:20:31Z
dc.description.abstract	Distributed computing environments are increasingly deployed over geographically spanning data centers using heterogeneous hardware systems. Failures within such environments incur considerable physical and computing time losses that are unacceptable for large scale scientific processing tasks. At present, resource management systems are limited in detecting and analyzing such occurrences beyond the level of alarms and notifications. The nature of these instabilities is mainly unknown, relying on subsystem expert knowledge and reactivity when they do occur. This work examines performance fluctuations associated with failures within a large scientific distributed production environment. We first present an approach to distinguish between expected operational behavior and service instability occurring within a data center, examined in the context of network quality, production job efficiency and job error-state deviation. This method identifies failure domains to allow for online detection of service-state fluctuations. We then propose a data center stability measure along with an event selection approach, used in analyzing past unstable behavior. We determine, that for detected events, states corresponding to an instability are observed before the occurrence of an event. For select events, we have also discovered a reliability model fit suggesting potential use in predictive analytics. Developed methods are able to detect a pre-failure period, identifying service failure domains affected by the instability. This allows user, central and data center experts to take action in advance of service failure effects, with the view on how this failure will be expected to develop. This work represents an incremental step towards automated proactive management of large distributed computing environments.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10657/2849
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	Grid computing
dc.subject	Reliability engineering
dc.subject	Performance analytics
dc.title	Detection and Analysis of Operational Instability within Distributed Computing Environments
dc.type.dcmi	Text
dc.type.genre	Thesis
local.embargo.lift	2019-12-01
local.embargo.terms	2019-12-01
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: DATSKOVA-DISSERTATION-2017.pdf
Size:: 9.96 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 4.44 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 1.83 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection