Abstract
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
Original language | English |
---|---|
Title of host publication | 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS) |
Publisher | IEEE Explore |
Pages | 111-120 |
Number of pages | 10 |
DOIs | |
Publication status | Published - Sept 2013 |
Bibliographical note
Acknowledgements: We thank the Texas Advanced Computing Center (TACC)for providing the Ranger message logs and resource use data,
and Malcolm Muggeridge (Xyratex) for granting access to
his researchers. This research was supported in part by the
National Science Foundation under OCI award #0622780 and
#1203604 to TACC at the University of Texas at Austin