Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis

Thuan Chuah; Arshad Jhumka; Samantha Alt; JJ Villalobos; Josh Fryman; Bill Barth; Manish Parashar

doi:10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis

Thuan Chuah, Arshad Jhumka, Samantha Alt, JJ Villalobos, Josh Fryman, Bill Barth, Manish Parashar

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

4 Citations (Scopus)

Abstract

Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.

Original language	English
Title of host publication	2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)
Publisher	IEEE Explore
Pages	458-467
Number of pages	10
DOIs	https://doi.org/10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072
Publication status	Published - 18 Dec 2019

Bibliographical note

Acknowledgements: The Ranger cluster log-data was provided by the Texas
Advanced Computing Center (TACC). We like to thank Karl
Solchenbach (Intel) for granting access to his engineers. We
thank the anonymous reviewers for their constructive feedback
which helped improve the paper significantly. This research is
supported by The Alan Turing Institute under the EPSRC grant
EP/N510129/1, The Alan Turing Institute-Intel partnership, The
University of Warwick Department of Computer Science scholarship, The National Science Foundation under OCI awards
#0622780, #1203604 and #1134872 to TACC at The University
of Texas at Austin.

Access to Document

10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

Cite this

Chuah, T., Jhumka, A., Alt, S., Villalobos, JJ., Fryman, J., Barth, B., & Parashar, M. (2019). Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis. In 2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA) (pp. 458-467). IEEE Explore. https://doi.org/10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis. / Chuah, Thuan; Jhumka, Arshad; Alt, Samantha et al.
2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE Explore, 2019. p. 458-467.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Chuah, T, Jhumka, A, Alt, S, Villalobos, JJ, Fryman, J, Barth, B & Parashar, M 2019, Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis. in 2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEE Explore, pp. 458-467. https://doi.org/10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

@inproceedings{6c0925332bf546a7b4d5388ec24852fc,

title = "Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis",

abstract = "Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.",

author = "Thuan Chuah and Arshad Jhumka and Samantha Alt and JJ Villalobos and Josh Fryman and Bill Barth and Manish Parashar",

note = "Acknowledgements: The Ranger cluster log-data was provided by the Texas Advanced Computing Center (TACC). We like to thank Karl Solchenbach (Intel) for granting access to his engineers. We thank the anonymous reviewers for their constructive feedback which helped improve the paper significantly. This research is supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1, The Alan Turing Institute-Intel partnership, The University of Warwick Department of Computer Science scholarship, The National Science Foundation under OCI awards #0622780, #1203604 and #1134872 to TACC at The University of Texas at Austin.",

year = "2019",

month = dec,

day = "18",

doi = "10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072",

language = "English",

pages = "458--467",

booktitle = "2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)",

publisher = "IEEE Explore",

}

TY - GEN

T1 - Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis

AU - Chuah, Thuan

AU - Jhumka, Arshad

AU - Alt, Samantha

AU - Villalobos, JJ

AU - Fryman, Josh

AU - Barth, Bill

AU - Parashar, Manish

N1 - Acknowledgements: The Ranger cluster log-data was provided by the Texas Advanced Computing Center (TACC). We like to thank Karl Solchenbach (Intel) for granting access to his engineers. We thank the anonymous reviewers for their constructive feedback which helped improve the paper significantly. This research is supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1, The Alan Turing Institute-Intel partnership, The University of Warwick Department of Computer Science scholarship, The National Science Foundation under OCI awards #0622780, #1203604 and #1134872 to TACC at The University of Texas at Austin.

PY - 2019/12/18

Y1 - 2019/12/18

N2 - Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.

AB - Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.

UR - http://dx.doi.org/10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

U2 - 10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

DO - 10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00072

M3 - Published conference contribution

SP - 458

EP - 467

BT - 2019 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)

PB - IEEE Explore

ER -

Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this