An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

Edward Chuah* (Corresponding Author), Arshad Jhumka, Sai Narasimhamurthy

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.
Original languageEnglish
Pages (from-to)18445-18479
Number of pages35
JournalJournal of Supercomputing
Volume79
Early online date15 May 2023
DOIs
Publication statusPublished - Nov 2023

Bibliographical note

Acknowledgements
We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.

Funding
No funding was received to assist with the preparation of this manuscript.

Data Availability Statement

Data availability
The datasets analyzed during this study are available from the corresponding author on request.

Keywords

  • large cluster systems
  • major page faults
  • system logs
  • resource use data
  • Regression Analysis

Fingerprint

Dive into the research topics of 'An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems'. Together they form a unique fingerprint.

Cite this