Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Thuan Chuah; Arshad Jhumka; James Browne; Nentawe Gurumdimma; Sai Narasimhamurthy; Bill Barth

doi:10.1109/hipc.2016.035

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Thuan Chuah, Arshad Jhumka, James Browne, Nentawe Gurumdimma, Sai Narasimhamurthy, Bill Barth

Computing Science

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

7 Citations (Scopus)

Abstract

Failure diagnosis for large compute clusters using
only message logs is known to be incomplete. Recent availability
of resource use data provides another potentially useful source of
data for failure detection and diagnosis. Early work combining
message logs and resource use data for failure diagnosis has
shown promising results. This paper describes the CRUMEL
framework which implements a new approach to combining
rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and
correlates these patterns by time with system failures. Application
of CRUMEL to data from the Ranger supercomputer has yielded
improved diagnoses over previous research. CRUMEL has: (i)
showed that more events correlated with system failures can
only be identified by applying different correlation algorithms,
(ii) confirmed six groups of errors, (iii) identified Lustre I/O
resource use counters which are correlated with occurrence of
Lustre faults which are potential flags for online detection of
failures, (iv) matched the dates of correlated error events and
correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with
compute node hang-ups. The pre-processed data will be put on
the public domain in September, 2016.

Original language	English
Title of host publication	2016 IEEE 23rd International Conference on High Performance Computing (HiPC)
Publisher	IEEE Explore
Pages	232-241
Number of pages	10
DOIs	https://doi.org/10.1109/hipc.2016.035
Publication status	Published - Dec 2016

Bibliographical note

We would like to thank the Texas Advanced Computing Center for providing the Ranger system logs and case studies. This research was supported in part by the National Science Foundation under DCI award #0622780 to the Texas Advanced Computing Center at the University of Texas at Austin.

Access to Document

10.1109/hipc.2016.035

Cite this

@inproceedings{d9ba2edf6da8481bb3b138f0e3e891e7,

title = "Using Message Logs and Resource Use Data for Cluster Failure Diagnosis",

abstract = "Failure diagnosis for large compute clusters usingonly message logs is known to be incomplete. Recent availabilityof resource use data provides another potentially useful source ofdata for failure detection and diagnosis. Early work combiningmessage logs and resource use data for failure diagnosis hasshown promising results. This paper describes the CRUMELframework which implements a new approach to combiningrationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use andcorrelates these patterns by time with system failures. Applicationof CRUMEL to data from the Ranger supercomputer has yieldedimproved diagnoses over previous research. CRUMEL has: (i)showed that more events correlated with system failures canonly be identified by applying different correlation algorithms,(ii) confirmed six groups of errors, (iii) identified Lustre I/Oresource use counters which are correlated with occurrence ofLustre faults which are potential flags for online detection offailures, (iv) matched the dates of correlated error events andcorrelated resource use with the dates of compute node hangups and (v) identified two more error groups associated withcompute node hang-ups. The pre-processed data will be put onthe public domain in September, 2016.",

author = "Thuan Chuah and Arshad Jhumka and James Browne and Nentawe Gurumdimma and Sai Narasimhamurthy and Bill Barth",

note = "We would like to thank the Texas Advanced Computing Center for providing the Ranger system logs and case studies. This research was supported in part by the National Science Foundation under DCI award #0622780 to the Texas Advanced Computing Center at the University of Texas at Austin.",

year = "2016",

month = dec,

doi = "10.1109/hipc.2016.035",

language = "English",

pages = "232--241",

booktitle = "2016 IEEE 23rd International Conference on High Performance Computing (HiPC)",

publisher = "IEEE Explore",

}

TY - GEN

T1 - Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

AU - Chuah, Thuan

AU - Jhumka, Arshad

AU - Browne, James

AU - Gurumdimma, Nentawe

AU - Narasimhamurthy, Sai

AU - Barth, Bill

N1 - We would like to thank the Texas Advanced Computing Center for providing the Ranger system logs and case studies. This research was supported in part by the National Science Foundation under DCI award #0622780 to the Texas Advanced Computing Center at the University of Texas at Austin.

PY - 2016/12

Y1 - 2016/12

N2 - Failure diagnosis for large compute clusters usingonly message logs is known to be incomplete. Recent availabilityof resource use data provides another potentially useful source ofdata for failure detection and diagnosis. Early work combiningmessage logs and resource use data for failure diagnosis hasshown promising results. This paper describes the CRUMELframework which implements a new approach to combiningrationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use andcorrelates these patterns by time with system failures. Applicationof CRUMEL to data from the Ranger supercomputer has yieldedimproved diagnoses over previous research. CRUMEL has: (i)showed that more events correlated with system failures canonly be identified by applying different correlation algorithms,(ii) confirmed six groups of errors, (iii) identified Lustre I/Oresource use counters which are correlated with occurrence ofLustre faults which are potential flags for online detection offailures, (iv) matched the dates of correlated error events andcorrelated resource use with the dates of compute node hangups and (v) identified two more error groups associated withcompute node hang-ups. The pre-processed data will be put onthe public domain in September, 2016.

AB - Failure diagnosis for large compute clusters usingonly message logs is known to be incomplete. Recent availabilityof resource use data provides another potentially useful source ofdata for failure detection and diagnosis. Early work combiningmessage logs and resource use data for failure diagnosis hasshown promising results. This paper describes the CRUMELframework which implements a new approach to combiningrationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use andcorrelates these patterns by time with system failures. Applicationof CRUMEL to data from the Ranger supercomputer has yieldedimproved diagnoses over previous research. CRUMEL has: (i)showed that more events correlated with system failures canonly be identified by applying different correlation algorithms,(ii) confirmed six groups of errors, (iii) identified Lustre I/Oresource use counters which are correlated with occurrence ofLustre faults which are potential flags for online detection offailures, (iv) matched the dates of correlated error events andcorrelated resource use with the dates of compute node hangups and (v) identified two more error groups associated withcompute node hang-ups. The pre-processed data will be put onthe public domain in September, 2016.

UR - http://dx.doi.org/10.1109/hipc.2016.035

U2 - 10.1109/hipc.2016.035

DO - 10.1109/hipc.2016.035

M3 - Published conference contribution

SP - 232

EP - 241

BT - 2016 IEEE 23rd International Conference on High Performance Computing (HiPC)

PB - IEEE Explore

ER -

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this