Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

Thuan Chuah; Arshad Jhumka; Samantha Alt; Daniel Balouek-Thomert; James Browne; Manish Parashar

doi:10.1016/j.jpdc.2019.05.013

Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

Thuan Chuah^* (Corresponding Author), Arshad Jhumka, Samantha Alt, Daniel Balouek-Thomert, James Browne, Manish Parashar

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

11 Citations (Scopus)

Abstract

Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018.

Original language	English
Pages (from-to)	95-112
Number of pages	18
Journal	Journal of Parallel and Distributed Computing (JPDC)
Volume	132
Early online date	7 Jun 2019
DOIs	https://doi.org/10.1016/j.jpdc.2019.05.013
Publication status	Published - Oct 2019

Bibliographical note

Acknowledgements: We would like to thank the Texas Advanced Computing Center (TACC) for providing the Stampede, Ranger & Lonestar4 cluster log-data, and to Tommy Minyard, William Lee Barth and Richard Todd Evans for granting access to their data and HPC systems. We would also like to thank Karl Solchenbach and Marie-Christine Sawley (Intel Corporation, Europe) for granting access to their research scientists, and Theo Damoulas (University of Warwick, UK) for his contribution to validating the CORRMEXT framework. We would also like to thank the anonymous reviewers for their constructive feedback which helped improve the paper significantly. This research is supported by The Alan Turing Institute under the EPSRC, UK grant EP/N510129/1, The Alan Turing Institute-Intel partnership and The National Science Foundation, USA under OCI awards #0622780, #1203604 and #1134872 to TACC at The University of Texas at Austin.

This paper is dedicated to the memory of Professor Emeritus James Clayton Browne (January 16, 1935–January 19, 2018). Dr. Browne’s contributions to developing the CORRMEXT framework have been crucial but sadly he passed away while we were working on the paper. As such, we wish to keep his name as an author posthumously.

Data Availability Statement

No data availability statement

Keywords

Large HPC systems
Correlation
Variance extraction
Error propagation and recovery
Cluster log-data

Access to Document

10.1016/j.jpdc.2019.05.013Licence: Unspecified

Cite this

@article{d1e7284fadca465882efaa94c1fa4087,

title = "Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis",

abstract = "Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018.",

keywords = "Large HPC systems, Correlation, Variance extraction, Error propagation and recovery, Cluster log-data",

author = "Thuan Chuah and Arshad Jhumka and Samantha Alt and Daniel Balouek-Thomert and James Browne and Manish Parashar",

note = "Acknowledgements: We would like to thank the Texas Advanced Computing Center (TACC) for providing the Stampede, Ranger & Lonestar4 cluster log-data, and to Tommy Minyard, William Lee Barth and Richard Todd Evans for granting access to their data and HPC systems. We would also like to thank Karl Solchenbach and Marie-Christine Sawley (Intel Corporation, Europe) for granting access to their research scientists, and Theo Damoulas (University of Warwick, UK) for his contribution to validating the CORRMEXT framework. We would also like to thank the anonymous reviewers for their constructive feedback which helped improve the paper significantly. This research is supported by The Alan Turing Institute under the EPSRC, UK grant EP/N510129/1, The Alan Turing Institute-Intel partnership and The National Science Foundation, USA under OCI awards #0622780, #1203604 and #1134872 to TACC at The University of Texas at Austin. This paper is dedicated to the memory of Professor Emeritus James Clayton Browne (January 16, 1935–January 19, 2018). Dr. Browne{\textquoteright}s contributions to developing the CORRMEXT framework have been crucial but sadly he passed away while we were working on the paper. As such, we wish to keep his name as an author posthumously.",

year = "2019",

month = oct,

doi = "10.1016/j.jpdc.2019.05.013",

language = "English",

volume = "132",

pages = "95--112",

journal = "Journal of Parallel and Distributed Computing (JPDC)",

issn = "0743-7315",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

AU - Chuah, Thuan

AU - Jhumka, Arshad

AU - Alt, Samantha

AU - Balouek-Thomert, Daniel

AU - Browne, James

AU - Parashar, Manish

N1 - Acknowledgements: We would like to thank the Texas Advanced Computing Center (TACC) for providing the Stampede, Ranger & Lonestar4 cluster log-data, and to Tommy Minyard, William Lee Barth and Richard Todd Evans for granting access to their data and HPC systems. We would also like to thank Karl Solchenbach and Marie-Christine Sawley (Intel Corporation, Europe) for granting access to their research scientists, and Theo Damoulas (University of Warwick, UK) for his contribution to validating the CORRMEXT framework. We would also like to thank the anonymous reviewers for their constructive feedback which helped improve the paper significantly. This research is supported by The Alan Turing Institute under the EPSRC, UK grant EP/N510129/1, The Alan Turing Institute-Intel partnership and The National Science Foundation, USA under OCI awards #0622780, #1203604 and #1134872 to TACC at The University of Texas at Austin. This paper is dedicated to the memory of Professor Emeritus James Clayton Browne (January 16, 1935–January 19, 2018). Dr. Browne’s contributions to developing the CORRMEXT framework have been crucial but sadly he passed away while we were working on the paper. As such, we wish to keep his name as an author posthumously.

PY - 2019/10

Y1 - 2019/10

N2 - Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018.

AB - Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018.

KW - Large HPC systems

KW - Correlation

KW - Variance extraction

KW - Error propagation and recovery

KW - Cluster log-data

UR - http://dx.doi.org/10.1016/j.jpdc.2019.05.013

U2 - 10.1016/j.jpdc.2019.05.013

DO - 10.1016/j.jpdc.2019.05.013

M3 - Article

SN - 0743-7315

VL - 132

SP - 95

EP - 112

JO - Journal of Parallel and Distributed Computing (JPDC)

JF - Journal of Parallel and Distributed Computing (JPDC)

ER -

Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis

Abstract

Bibliographical note

Data Availability Statement

Keywords

Access to Document

Other files and links

Fingerprint

Cite this