Diagnosing the root-causes of failures from cluster log files

Thuan Chuah, Shyh-hao Kuo, Paul Hiew, William-Chandra Tjhi, Gary Lee, John Hammond, Marek Michalewicz, Terence Hung, James Browne

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

39 Citations (Scopus)

Abstract

System event logs are often the primary source
of information for diagnosing (and predicting) the causes of
failures for cluster systems. Due to interactions among the
system hardware and software components, the system event
logs for large cluster systems are comprised of streams of
interleaved events, and only a small fraction of the events over
a small time span are relevant to the diagnosis of a given
failure. Furthermore, the process of troubleshooting the causes of
failures is largely manual and ad-hoc. In this paper, we present
a systematic methodology for reconstructing event order and
establishing correlations among events which indicate the rootcauses of a given failure from very large syslogs. We developed
a diagnostics tool, FDiag, to extract the log entries as structured
message templates and uses statistical correlation analysis to
establish probable cause and effect relationships for the fault
being analyzed. We applied FDiag to analyze failures due to
breakdowns in interactions between the Lustre file system and
its clients on the Ranger supercomputer at the Texas Advanced
Computing Center (TACC). The results are positive. FDiag is
able to identify the dates and the time periods that contain
the significant events which eventually led to the occurrence of
compute node soft lockups.
Original languageEnglish
Title of host publication2010 IEEE International Conference on High Performance Computing (HiPC)
PublisherIEEE Explore
Pages1-10
Number of pages10
DOIs
Publication statusPublished - 19 Dec 2010

Bibliographical note

The authors would like to thank Tommy Minyard from the Texas Advanced Computing Center (TACC) for providing the Ranger system logs, Chris Jordan from TACC for his suggestion of looking at evictions as a case study, Stephen Wong from ACRC for access to his system administrators and for the Turing system logs, and Ivor Tsang from Nanyang Technological University (Singapore) for his input. This research was supported by the National Science Foundation under a grant from the Office of CyberInfrastructure.

Fingerprint

Dive into the research topics of 'Diagnosing the root-causes of failures from cluster log files'. Together they form a unique fingerprint.

Cite this