TY - GEN
T1 - Insights into the Diagnosis of System Failures from Cluster Message Logs
AU - Chuah, Thuan
AU - Jhumka, Arshad
AU - Browne, James
AU - Barth, Bill
AU - Narasimhamurthy, Sai
PY - 2015/9
Y1 - 2015/9
N2 - Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
AB - Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
UR - http://dx.doi.org/10.1109/edcc.2015.19
U2 - 10.1109/edcc.2015.19
DO - 10.1109/edcc.2015.19
M3 - Published conference contribution
BT - 2015 11th European Dependable Computing Conference (EDCC)
PB - IEEE Explore
ER -