Detection and Analysis of Operational Instability within Distributed Computing Environments



Journal Title

Journal ISSN

Volume Title



Distributed computing environments are increasingly deployed over geographically spanning data centers using heterogeneous hardware systems. Failures within such environments incur considerable physical and computing time losses that are unacceptable for large scale scientific processing tasks. At present, resource management systems are limited in detecting and analyzing such occurrences beyond the level of alarms and notifications. The nature of these instabilities is mainly unknown, relying on subsystem expert knowledge and reactivity when they do occur. This work examines performance fluctuations associated with failures within a large scientific distributed production environment. We first present an approach to distinguish between expected operational behavior and service instability occurring within a data center, examined in the context of network quality, production job efficiency and job error-state deviation. This method identifies failure domains to allow for online detection of service-state fluctuations. We then propose a data center stability measure along with an event selection approach, used in analyzing past unstable behavior. We determine, that for detected events, states corresponding to an instability are observed before the occurrence of an event. For select events, we have also discovered a reliability model fit suggesting potential use in predictive analytics. Developed methods are able to detect a pre-failure period, identifying service failure domains affected by the instability. This allows user, central and data center experts to take action in advance of service failure effects, with the view on how this failure will be expected to develop. This work represents an incremental step towards automated proactive management of large distributed computing environments.



Grid computing, Reliability engineering, Performance analytics