Abstract
Failure diagnosis for large compute clusters using
only message logs is known to be incomplete. Recent availability
of resource use data provides another potentially useful source of
data for failure detection and diagnosis. Early work combining
message logs and resource use data for failure diagnosis has
shown promising results. This paper describes the CRUMEL
framework which implements a new approach to combining
rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and
correlates these patterns by time with system failures. Application
of CRUMEL to data from the Ranger supercomputer has yielded
improved diagnoses over previous research. CRUMEL has: (i)
showed that more events correlated with system failures can
only be identified by applying different correlation algorithms,
(ii) confirmed six groups of errors, (iii) identified Lustre I/O
resource use counters which are correlated with occurrence of
Lustre faults which are potential flags for online detection of
failures, (iv) matched the dates of correlated error events and
correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with
compute node hang-ups. The pre-processed data will be put on
the public domain in September, 2016.
only message logs is known to be incomplete. Recent availability
of resource use data provides another potentially useful source of
data for failure detection and diagnosis. Early work combining
message logs and resource use data for failure diagnosis has
shown promising results. This paper describes the CRUMEL
framework which implements a new approach to combining
rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and
correlates these patterns by time with system failures. Application
of CRUMEL to data from the Ranger supercomputer has yielded
improved diagnoses over previous research. CRUMEL has: (i)
showed that more events correlated with system failures can
only be identified by applying different correlation algorithms,
(ii) confirmed six groups of errors, (iii) identified Lustre I/O
resource use counters which are correlated with occurrence of
Lustre faults which are potential flags for online detection of
failures, (iv) matched the dates of correlated error events and
correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with
compute node hang-ups. The pre-processed data will be put on
the public domain in September, 2016.
Original language | English |
---|---|
Title of host publication | 2016 IEEE 23rd International Conference on High Performance Computing (HiPC) |
Publisher | IEEE Explore |
Pages | 232-241 |
Number of pages | 10 |
DOIs | |
Publication status | Published - Dec 2016 |