An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

Edward Chuah; Arshad Jhumka; Sai Narasimhamurthy

doi:10.1007/s11227-023-05366-1

An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

Edward Chuah^* (Corresponding Author), Arshad Jhumka, Sai Narasimhamurthy

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

Original language	English
Pages (from-to)	18445-18479
Number of pages	35
Journal	Journal of Supercomputing
Volume	79
Early online date	15 May 2023
DOIs	https://doi.org/10.1007/s11227-023-05366-1
Publication status	Published - Nov 2023

Bibliographical note

Acknowledgements
We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.

Funding
No funding was received to assist with the preparation of this manuscript.

Data Availability Statement

Data availability
The datasets analyzed during this study are available from the corresponding author on request.

Keywords

large cluster systems
major page faults
system logs
resource use data
Regression Analysis

Access to Document

10.1007/s11227-023-05366-1Licence: Unspecified

Embargoed Document

Chuah_etal_JoSC_An_Empirical_Study_AAM
Accepted author manuscript, 963 KB
Licence: Other
Embargo ends: 15/05/24

Cite this

@article{285db425e60d469a81ef2baeadc0f818,

title = "An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems",

abstract = "High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.",

keywords = "large cluster systems, major page faults, system logs, resource use data, Regression Analysis",

author = "Edward Chuah and Arshad Jhumka and Sai Narasimhamurthy",

note = "Acknowledgements We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly. Funding No funding was received to assist with the preparation of this manuscript.",

year = "2023",

month = nov,

doi = "10.1007/s11227-023-05366-1",

language = "English",

volume = "79",

pages = "18445--18479",

journal = "Journal of Supercomputing",

issn = "0920-8542",

publisher = "Springer Netherlands",

}

TY - JOUR

T1 - An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

AU - Chuah, Edward

AU - Jhumka, Arshad

AU - Narasimhamurthy, Sai

N1 - Acknowledgements We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly. Funding No funding was received to assist with the preparation of this manuscript.

PY - 2023/11

Y1 - 2023/11

N2 - High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

AB - High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

KW - large cluster systems

KW - major page faults

KW - system logs

KW - resource use data

KW - Regression Analysis

U2 - 10.1007/s11227-023-05366-1

DO - 10.1007/s11227-023-05366-1

M3 - Article

SN - 0920-8542

VL - 79

SP - 18445

EP - 18479

JO - Journal of Supercomputing

JF - Journal of Supercomputing

ER -

An Empirical Study of Major Page Faults for Failure Diagnosis in Cluster Systems

Abstract

Bibliographical note

Data Availability Statement

Keywords

Access to Document

Embargoed Document

Fingerprint

Cite this