Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Bouteiller, Aurelien; Cappello, Franck; Dongarra, Jack; Guermouche, Amina; Hérault, Thomas; Robert, Yves

doi:10.1007/978-3-642-40047-6_43

Aurelien Bouteiller¹⁹,
Franck Cappello^20,21,
Jack Dongarra¹⁹,
Amina Guermouche²²,
Thomas Hérault¹⁹ &
…
Yves Robert^19,23

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8097))

Included in the following conference series:

European Conference on Parallel Processing

3780 Accesses
5 Citations

Abstract

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that has been given to per-application completion time, rather than to platform efficiency. In this paper, we discuss the case of uncoordinated rollback recovery where the idle time spent waiting recovering processors is used to progress a different, independent application from the system batch queue. We then propose an extended model of uncoordinated checkpointing that can discriminate between idle time and wasted computation. We instantiate this model in a simulator to demonstrate that, with this strategy, uncoordinated checkpointing per application completion time is unchanged, while it delivers near-perfect platform efficiency.

Download to read the full chapter text

Chapter PDF

Optimal Checkpointing Period: Time vs. Energy

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Scheduling for Fault-Tolerance: An Introduction

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. IJHPCA 23(4), 309–322 (2009)
Google Scholar
Gibson, G.: Failure tolerance in petascale computers. Journal of Physics: Conference Series 78, 012022 (2007)
Article Google Scholar
Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of SC 2011. ACM/IEEE (2011)
Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Survey 34, 375–408 (2002)
Article Google Scholar
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 51–64. Springer, Heidelberg (2011)
Chapter Google Scholar
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic MPI applications. In: Proc. 26th IPDPS, pp. 1216–1227. IEEE (May 2012)
Google Scholar
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Research report RR-7950, INRIA (2012)
Google Scholar
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518–528 (1984)
Article Google Scholar
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. 10th ACM SIGPLAN PPoPP, pp. 213–223. ACM (2005)
Google Scholar
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V: a multiprotocol fault tolerant MPI. IJHPCA 20(3), 319–333 (2006)
Google Scholar
Bouteiller, A., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y.: Multi-criteria checkpointing strategies: Optimizing response-time versus resource utilization. Research report ICL-UT-1301, University of Tennessee (February 2013)
Google Scholar
Miyazaki, H., Kusano, Y., Okano, H., Nakada, T., Seki, K., Shimizu, T., Shinjo, N., Shoji, F., Uno, A., Kurokawa, M.: K computer: 8.162 petaflops massively parallel scalar supercomputer built with over 548k cores. In: ISSCC, pp. 192–194. IEEE (2012)
Google Scholar
Chakravorty, S., Kale, L.: A fault tolerance protocol with fast fault recovery. In: Proc. 21st IPDPS, pp. 1–10. IEEE (March 2007)
Google Scholar
Yang, X., Du, Y., Wang, P., Fu, H., Jia, J.: FTPA: Supporting fault-tolerant parallel computing through parallel recomputing. IEEE Transactions on Parallel and Distributed Systems 20(10), 1471–1486 (2009)
Article Google Scholar
Gustafson, J.L.: Reevaluating Amdahl’s law. Communications of the ACM 31, 532–533 (1988)
Article Google Scholar
Thekkath, R., Eggers, S.J.: The effectiveness of multiple hardware contexts. In: Proc. of the 6th ASPLOS, pp. 328–337. ACM (1994)
Google Scholar
Huang, C., Zheng, G., Kalé, L., Kumar, S.: Performance evaluation of Adaptive MPI. In: Proc. 11th ACM SIGPLAN PPoPP, pp. 12–21. ACM (2006)
Google Scholar
Bouteiller, A., Bouziane, H.L., Herault, T., Lemarinier, P., Cappello, F.: Hybrid preemptive scheduling of message passing interface applications on grids. IJHPCA 20(1), 77–90 (2006)
Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22(3), 303–312 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Tennessee Knoxville, USA
Aurelien Bouteiller, Jack Dongarra, Thomas Hérault & Yves Robert
University of Illinois at Urbana Champaign, USA
Franck Cappello
INRIA, France
Franck Cappello
Univ. Versailles St Quentin, France
Amina Guermouche
Ecole Normale Supérieure de Lyon, France
Yves Robert

Authors

Aurelien Bouteiller
View author publications
You can also search for this author in PubMed Google Scholar
Franck Cappello
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar
Amina Guermouche
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Hérault
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

German Research School for Simulation Sciences, RWTH Aachen, Schinkelstr. 2a, 52062, Aachen, Germany
Felix Wolf
Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Station 22,, 52425, Jülich, Germany
Bernd Mohr
Center for Computing and Communication, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouteiller, A., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y. (2013). Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization. In: Wolf, F., Mohr, B., an Mey, D. (eds) Euro-Par 2013 Parallel Processing. Euro-Par 2013. Lecture Notes in Computer Science, vol 8097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40047-6_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-40047-6_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40046-9
Online ISBN: 978-3-642-40047-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Abstract

Chapter PDF

Similar content being viewed by others

Optimal Checkpointing Period: Time vs. Energy

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Scheduling for Fault-Tolerance: An Introduction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

Abstract

Chapter PDF

Similar content being viewed by others

Optimal Checkpointing Period: Time vs. Energy

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Scheduling for Fault-Tolerance: An Introduction

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation