Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Edward Chuah; Arshad Jhumka; Sai Narasimhamurthy; John Hammond; James Browne; Bill Barth

doi:10.1109/srds.2013.20

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Edward Chuah, Arshad Jhumka, Sai Narasimhamurthy, John Hammond, James Browne, Bill Barth

Computing Science

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

36 Citations (Scopus)

Abstract

Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.

Original language	English
Title of host publication	2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS)
Publisher	IEEE Explore
Pages	111-120
Number of pages	10
DOIs	https://doi.org/10.1109/srds.2013.20
Publication status	Published - Sept 2013

Bibliographical note

Acknowledgements: We thank the Texas Advanced Computing Center (TACC)
for providing the Ranger message logs and resource use data,
and Malcolm Muggeridge (Xyratex) for granting access to
his researchers. This research was supported in part by the
National Science Foundation under OCI award #0622780 and
#1203604 to TACC at the University of Texas at Austin

Access to Document

10.1109/srds.2013.20

Cite this

@inproceedings{754b9eeb32a548e38903f7244265e483,

title = "Linking Resource Usage Anomalies with System Failures from Cluster Log Data",

abstract = "Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.",

author = "Edward Chuah and Arshad Jhumka and Sai Narasimhamurthy and John Hammond and James Browne and Bill Barth",

note = "Acknowledgements: We thank the Texas Advanced Computing Center (TACC) for providing the Ranger message logs and resource use data, and Malcolm Muggeridge (Xyratex) for granting access to his researchers. This research was supported in part by the National Science Foundation under OCI award #0622780 and #1203604 to TACC at the University of Texas at Austin",

year = "2013",

month = sep,

doi = "10.1109/srds.2013.20",

language = "English",

pages = "111--120",

booktitle = "2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS)",

publisher = "IEEE Explore",

}

TY - GEN

T1 - Linking Resource Usage Anomalies with System Failures from Cluster Log Data

AU - Chuah, Edward

AU - Jhumka, Arshad

AU - Narasimhamurthy, Sai

AU - Hammond, John

AU - Browne, James

AU - Barth, Bill

N1 - Acknowledgements: We thank the Texas Advanced Computing Center (TACC) for providing the Ranger message logs and resource use data, and Malcolm Muggeridge (Xyratex) for granting access to his researchers. This research was supported in part by the National Science Foundation under OCI award #0622780 and #1203604 to TACC at the University of Texas at Austin

PY - 2013/9

Y1 - 2013/9

N2 - Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.

AB - Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.

UR - http://dx.doi.org/10.1109/srds.2013.20

U2 - 10.1109/srds.2013.20

DO - 10.1109/srds.2013.20

M3 - Published conference contribution

SP - 111

EP - 120

BT - 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS)

PB - IEEE Explore

ER -

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this