iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/978-3-030-90539-2_25
An Explainable Model for Fault Detection in HPC Systems | SpringerLink
Skip to main content

An Explainable Model for Fault Detection in HPC Systems

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Abstract

Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supercomputing nodes. The proposed model aims at classifying the different states of the computing nodes thanks to the labeled data describing the supercomputer behaviour, data which is typically collected by system administrators but not integrated in holistic monitoring infrastructure for data center automation. In comparison the other method, the one proposed here is robust and provide explainable predictions. The model has been trained and validated on data gathered from a tier-0 supercomputer in production.

Supported by University of Bologna and CINECA, Italy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Change history

  • 04 January 2022

    The chapter was inadvertently published with the spelling error in the first author’s name. It has been corrected to “Martin Molan”.

Notes

  1. 1.

    April-July 2019.

  2. 2.

    Timestampsuniquelyidentifiesrows/examples.

  3. 3.

    Value 1 is unused.

  4. 4.

    Hyperparameters: DT max dept equal to none, splitting heuristic Gini impurity, min samples leaf equal to 1; L-SVM loss function squared hinge, regularization l2; NN number of neighbours equal to 5, uniform weights, euclidean metric; RBF-SVM has RBF kernel, regularization parameter equal to 1, RF number of estimators equal to 10, base estimator parameters same as DT.

References

  1. Cineca inter-university consortium web site. www.cineca.it//en. Accessed 29 Jun 2018

  2. Sensu go: Sensu go 5.20, docs.sensu.io/sensu-go/latest/

  3. Barth, W.: Nagios: system and network monitoring. No Starch Press (2008)

    Google Scholar 

  4. Bartolini, A., Borghesi, A., et al.: The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, Ischia, Italy, 2018 (2018)

    Google Scholar 

  5. Bartolini, A., Beneventi, F., Borghesi, A., Cesarini, D., Libri, A., Benini, L., Cavazzoni, C.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops. ICPP 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215

  6. Beneventi, F., Bartolini, A., et al.: Continuous learning of hpc infrastructure models using big data analytics and in-memory processing tools. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1038–1043. European Design and Automation Association (2017)

    Google Scholar 

  7. Borghesi, A., Bartolini, A., et al.: Anomaly detection using autoencoders in hpc systems. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  8. Borghesi, A., Bartolini, A., et al.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)

    Google Scholar 

  9. Borghesi, A., Libri, A., et al.: Online anomaly detection in hpc systems. In: 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 229–233. IEEE (2019)

    Google Scholar 

  10. Bulathwela, S., Perez-Ortiz, M., et al.: Truelearn: a family of bayesian algorithms to match lifelong learners to open educational resources. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)

    Google Scholar 

  11. Burkart, N., Huber, M.F.: A survey on the explainability of supervised machine learning. CoRR abs/2011.07876 (2020). arxiv.org/abs/2011.07876

  12. Graepel, T., Candela, J., et al.: Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. Omnipress (2010)

    Google Scholar 

  13. Herbrich, R., Minka, T., Graepel, T.: Trueskill™: a bayesian skill rating system. In: Advances in neural information processing systems, pp. 569–576 (2007)

    Google Scholar 

  14. Iannone, F., Bracco, G., et al.: Marconi-fusion: the new high performance computing facility for european nuclear fusion modelling. Fusion Eng. Design 129, 354–358 (2018)

    Google Scholar 

  15. Massie, M.: Monitoring with Ganglia. O’Reilly Media, Sebastopol, CA (2012)

    Google Scholar 

  16. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019)

  17. Molan, M., Bulathwela, S., Orlic, D.: Accessibility recommendation system. In: Proceedings of the OER20: Open Education Conference (2020)

    Google Scholar 

  18. Netti, A., Kiziltan, Z., et al.: A machine learning approach to online fault classification in hpc systems. Future Gener. Comput. Syst. (2019)

    Google Scholar 

  19. Netti, A., Mueller, M., Guillen, C., Ott, M., Tafani, D., Ozer, G., Schulz, M.: Dcdb wintermute: enabling online and holistic operational data analytics on hpc systems (2019)

    Google Scholar 

  20. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  21. Pelánek, R.: Applications of the elo rating system in adaptive educational systems. Comput. Educ. 98, 169–179 (2016)

    Google Scholar 

  22. Sammut, C., Webb, G.I. (eds.): Attribute-value learning. Springer, US (2010)

    Google Scholar 

  23. Sharma, H., Kumar, S.: A survey on decision tree algorithms of classification in data mining. Int. J. Sci. Res. (IJSR) 5(4) (2016)

    Google Scholar 

  24. Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC 2017. Lecture Notes in Computer Science, vol. 10266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19

  25. Yang, X., Wang, Z., Xue, J., Zhou, Y.: The reliability wall for exascale supercomputing. IEEE Trans. Comput. 61(6), 767–779 (2012)

    Google Scholar 

  26. Zamuda, A., Zarges, C., Stiglic, G., Hrovat, G.: Stability selection using a genetic algorithm and logistic linear regression on healthcare records. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, p. 143–144. GECCO ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3067695.3076077

Download references

Acknowledgements

This research was partly supported by the

EU H2020-ICT-11–2018-2019 IoTwins project (g.a. 857191),

the H2020-JTI-EuroHPC-2019–1 Regale project (g.a. 956560)

and Emilia-Romagna POR-FESR 2014–2020 project “SUPER: SuperComputing Unifier Platform - Emilia-Romagna”.

We also thank CINECA for the collaboration and access to their machines.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Molan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Molan, M., Borghesi, A., Beneventi, F., Guarrasi, M., Bartolini, A. (2021). An Explainable Model for Fault Detection in HPC Systems. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90539-2_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90538-5

  • Online ISBN: 978-3-030-90539-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics