An Explainable Model for Fault Detection in HPC Systems

Molan, Martin; Borghesi, Andrea; Beneventi, Francesco; Guarrasi, Massimiliano; Bartolini, Andrea

doi:10.1007/978-3-030-90539-2_25

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

International Conference on High Performance Computing

1813 Accesses
2 Citations

The original version of this chapter was revised: The spelling of the first author’s name was corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-90539-2_36

Abstract

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Supported by University of Bologna and CINECA, Italy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Online Fault Classification in HPC Systems Through Machine Learning

Fault Data Analytics Using Decision Tree for Fault Detection

Applying Data Analytic Techniques for Fault Detection

Change history

04 January 2022
The chapter was inadvertently published with the spelling error in the first author’s name. It has been corrected to “Martin Molan”.

Notes

1.
April-July 2019.
2.
Timestampsuniquelyidentifiesrows/examples.
3.
Value 1 is unused.
4.
Hyperparameters: DT max dept equal to none, splitting heuristic Gini impurity, min samples leaf equal to 1; L-SVM loss function squared hinge, regularization l2; NN number of neighbours equal to 5, uniform weights, euclidean metric; RBF-SVM has RBF kernel, regularization parameter equal to 1, RF number of estimators equal to 10, base estimator parameters same as DT.

References

Cineca inter-university consortium web site. www.cineca.it//en. Accessed 29 Jun 2018
Sensu go: Sensu go 5.20, docs.sensu.io/sensu-go/latest/
Barth, W.: Nagios: system and network monitoring. No Starch Press (2008)
Google Scholar
Bartolini, A., Borghesi, A., et al.: The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, Ischia, Italy, 2018 (2018)
Google Scholar
Bartolini, A., Beneventi, F., Borghesi, A., Cesarini, D., Libri, A., Benini, L., Cavazzoni, C.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops. ICPP 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215
Beneventi, F., Bartolini, A., et al.: Continuous learning of hpc infrastructure models using big data analytics and in-memory processing tools. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1038–1043. European Design and Automation Association (2017)
Google Scholar
Borghesi, A., Bartolini, A., et al.: Anomaly detection using autoencoders in hpc systems. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Borghesi, A., Bartolini, A., et al.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)
Google Scholar
Borghesi, A., Libri, A., et al.: Online anomaly detection in hpc systems. In: 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 229–233. IEEE (2019)
Google Scholar
Bulathwela, S., Perez-Ortiz, M., et al.: Truelearn: a family of bayesian algorithms to match lifelong learners to open educational resources. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Google Scholar
Burkart, N., Huber, M.F.: A survey on the explainability of supervised machine learning. CoRR abs/2011.07876 (2020). arxiv.org/abs/2011.07876
Graepel, T., Candela, J., et al.: Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress (2010)
Google Scholar
Herbrich, R., Minka, T., Graepel, T.: Trueskill™: a bayesian skill rating system. In: Advances in neural information processing systems, pp. 569–576 (2007)
Google Scholar
Iannone, F., Bracco, G., et al.: Marconi-fusion: the new high performance computing facility for european nuclear fusion modelling. Fusion Eng. Design 129, 354–358 (2018)
Google Scholar
Massie, M.: Monitoring with Ganglia. O’Reilly Media, Sebastopol, CA (2012)
Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019)
Molan, M., Bulathwela, S., Orlic, D.: Accessibility recommendation system. In: Proceedings of the OER20: Open Education Conference (2020)
Google Scholar
Netti, A., Kiziltan, Z., et al.: A machine learning approach to online fault classification in hpc systems. Future Gener. Comput. Syst. (2019)
Google Scholar
Netti, A., Mueller, M., Guillen, C., Ott, M., Tafani, D., Ozer, G., Schulz, M.: Dcdb wintermute: enabling online and holistic operational data analytics on hpc systems (2019)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Pelánek, R.: Applications of the elo rating system in adaptive educational systems. Comput. Educ. 98, 169–179 (2016)
Google Scholar
Sammut, C., Webb, G.I. (eds.): Attribute-value learning. Springer, US (2010)
Google Scholar
Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining. Int. J. Sci. Res. (IJSR) 5(4) (2016)
Google Scholar
Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC 2017. Lecture Notes in Computer Science, vol. 10266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Google Scholar
Zamuda, A., Zarges, C., Stiglic, G., Hrovat, G.: Stability selection using a genetic algorithm and logistic linear regression on healthcare records. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, p. 143–144. GECCO ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3067695.3076077

Download references

Acknowledgements

This research was partly supported by the

EU H2020-ICT-11–2018-2019 IoTwins project (g.a. 857191),

the H2020-JTI-EuroHPC-2019–1 Regale project (g.a. 956560)

and Emilia-Romagna POR-FESR 2014–2020 project “SUPER: SuperComputing Unifier Platform - Emilia-Romagna”.

We also thank CINECA for the collaboration and access to their machines.

Author information

Authors and Affiliations

University of Bologna, Bologna, Italy
Martin Molan, Andrea Borghesi, Francesco Beneventi & Andrea Bartolini
CINECA, Reno, Italy
Massimiliano Guarrasi

Authors

Martin Molan
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Borghesi
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Beneventi
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Guarrasi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Bartolini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Molan .

Editor information

Editors and Affiliations

University of Tennessee at Knoxville, Knowville, TN, USA
Heike Jagode
Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Hartwig Anzt
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief
University of Tennessee System, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Molan, M., Borghesi, A., Beneventi, F., Guarrasi, M., Bartolini, A. (2021). An Explainable Model for Fault Detection in HPC Systems. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-90539-2_25
Published: 13 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Explainable Model for Fault Detection in HPC Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Online Fault Classification in HPC Systems Through Machine Learning

Fault Data Analytics Using Decision Tree for Fault Detection

Applying Data Analytic Techniques for Fault Detection

Change history

04 January 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Explainable Model for Fault Detection in HPC Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Online Fault Classification in HPC Systems Through Machine Learning

Fault Data Analytics Using Decision Tree for Fault Detection

Applying Data Analytic Techniques for Fault Detection

Change history

04 January 2022

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation