Abstract
Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local checkpoints taken by the individual processes form a consistent global checkpoint (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proc. 11th Symposium on Reliable Systems, pp. 39–47 (1992)
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems 3(1), 63–75 (1985)
Strom, R.E., Yemini, S.A.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)
Prakash, R., Singhal, M.: Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Transactions on Parallel and Distributed Systems 7(10), 1035–1048 (1996)
Alvisi, L., et al.: An Analysis of communication-induced checkpointing. Technical Report, TR-99-01. Department of Computer Science, Univ. of Texas, Austin (1999)
Hélary, J.-M., Mostefaoui, A., Raynal, M.: Communication-based prevention of useless checkpoints in distributed computations. Distributed Computing 13, 29–43 (2000)
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on software engineering SE-13(1), 23–31 (1987)
Elnozahy, E.N., Zwaenepoel, W.: Manetho: transparent rollback-recovery with lo woverhead, limited rollback and fast output commit. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531
Cristian, F., Aguili, H., Strong, R.: Atomic broadcast: from simple message diffusion to Byzantine agreement. In: Proc. 15th IEEE Fault Tolerant Computer Systems, pp. 200–206 (1995)
Schlichting, R.D., Schneider, F.B.: Fail-Stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems 1(3), 222–238 (1983)
Jalote, P.: Fault tolerance in distributed systems. Prentice Hall, Englewood Cliffs (1994)
Cechin, S.L., Jansch-Pôrto, I.: A New Efficient Coordinated Checkpointing. In: Proc. 2nd IEEE Latin American Test Workshop, Cancun, Mexico, pp. 56–61 (2001)
Lamport, L.: The temporal logic of actions. ACM Transactions on Programming Languages and Systems 16(3), 872–923 (1994)
Cechin, S.L.: TLA formal proof of rollback recovery protocol. Technical Report (RP- 319). Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil (2002)
Zhong, H., Nieh, J.: Crak: Linux checkpointing/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University, Columbia, USA (2001)
Rubini, A.: Linux device drivers. Market Books (1999)
Fontoura, A. B.: Evaluation of approaches for capturing the application data. MSc. Dissertation. Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (2002) (in Portuguese)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Buligon, C., Cechin, S., Jansch-Pôrto, I. (2005). Implementing Rollback-Recovery Coordinated Checkpoints. In: Ramos, F.F., Larios Rosillo, V., Unger, H. (eds) Advanced Distributed Systems. ISSADS 2005. Lecture Notes in Computer Science, vol 3563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11533962_22
Download citation
DOI: https://doi.org/10.1007/11533962_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28063-7
Online ISBN: 978-3-540-31674-9
eBook Packages: Computer ScienceComputer Science (R0)