Abstract
Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.
Supported by University of Bologna and CINECA, Italy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Change history
04 January 2022
The chapter was inadvertently published with the spelling error in the first author’s name. It has been corrected to “Martin Molan”.
Notes
- 1.
April-July 2019.
- 2.
Timestampsuniquelyidentifiesrows/examples.
- 3.
Value 1 is unused.
- 4.
Hyperparameters: DT max dept equal to none, splitting heuristic Gini impurity, min samples leaf equal to 1; L-SVM loss function squared hinge, regularization l2; NN number of neighbours equal to 5, uniform weights, euclidean metric; RBF-SVM has RBF kernel, regularization parameter equal to 1, RF number of estimators equal to 10, base estimator parameters same as DT.
References
Cineca inter-university consortium web site. www.cineca.it//en. Accessed 29 Jun 2018
Sensu go: Sensu go 5.20, docs.sensu.io/sensu-go/latest/
Barth, W.: Nagios: system and network monitoring. No Starch Press (2008)
Bartolini, A., Borghesi, A., et al.: The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, Ischia, Italy, 2018 (2018)
Bartolini, A., Beneventi, F., Borghesi, A., Cesarini, D., Libri, A., Benini, L., Cavazzoni, C.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops. ICPP 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215
Beneventi, F., Bartolini, A., et al.: Continuous learning of hpc infrastructure models using big data analytics and in-memory processing tools. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1038–1043. European Design and Automation Association (2017)
Borghesi, A., Bartolini, A., et al.: Anomaly detection using autoencoders in hpc systems. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Borghesi, A., Bartolini, A., et al.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)
Borghesi, A., Libri, A., et al.: Online anomaly detection in hpc systems. In: 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 229–233. IEEE (2019)
Bulathwela, S., Perez-Ortiz, M., et al.: Truelearn: a family of bayesian algorithms to match lifelong learners to open educational resources. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Burkart, N., Huber, M.F.: A survey on the explainability of supervised machine learning. CoRR abs/2011.07876 (2020). arxiv.org/abs/2011.07876
Graepel, T., Candela, J., et al.: Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress (2010)
Herbrich, R., Minka, T., Graepel, T.: Trueskill™: a bayesian skill rating system. In: Advances in neural information processing systems, pp. 569–576 (2007)
Iannone, F., Bracco, G., et al.: Marconi-fusion: the new high performance computing facility for european nuclear fusion modelling. Fusion Eng. Design 129, 354–358 (2018)
Massie, M.: Monitoring with Ganglia. O’Reilly Media, Sebastopol, CA (2012)
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019)
Molan, M., Bulathwela, S., Orlic, D.: Accessibility recommendation system. In: Proceedings of the OER20: Open Education Conference (2020)
Netti, A., Kiziltan, Z., et al.: A machine learning approach to online fault classification in hpc systems. Future Gener. Comput. Syst. (2019)
Netti, A., Mueller, M., Guillen, C., Ott, M., Tafani, D., Ozer, G., Schulz, M.: Dcdb wintermute: enabling online and holistic operational data analytics on hpc systems (2019)
Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pelánek, R.: Applications of the elo rating system in adaptive educational systems. Comput. Educ. 98, 169–179 (2016)
Sammut, C., Webb, G.I. (eds.): Attribute-value learning. Springer, US (2010)
Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining. Int. J. Sci. Res. (IJSR) 5(4) (2016)
Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC 2017. Lecture Notes in Computer Science, vol. 10266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)
Zamuda, A., Zarges, C., Stiglic, G., Hrovat, G.: Stability selection using a genetic algorithm and logistic linear regression on healthcare records. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, p. 143–144. GECCO ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3067695.3076077
Acknowledgements
This research was partly supported by the
EU H2020-ICT-11–2018-2019 IoTwins project (g.a. 857191),
the H2020-JTI-EuroHPC-2019–1 Regale project (g.a. 956560)
and Emilia-Romagna POR-FESR 2014–2020 project “SUPER: SuperComputing Unifier Platform - Emilia-Romagna”.
We also thank CINECA for the collaboration and access to their machines.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Molan, M., Borghesi, A., Beneventi, F., Guarrasi, M., Bartolini, A. (2021). An Explainable Model for Fault Detection in HPC Systems. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-90539-2_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)