High-Performance Computing (HPC) systems conduct extensive logging of resource usage data and system logs, and parsing this data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.
We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.
No funding was received to assist with the preparation of this manuscript.
Data Availability StatementData availability
The datasets analyzed during this study are available from the corresponding author on request.
- large cluster systems
- major page faults
- system logs
- resource use data
- Regression Analysis