iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1007/S11227-010-0431-1
An implementation of a replicated file server supporting the crash-recovery failure model | The Journal of Supercomputing Skip to main content
Log in

An implementation of a replicated file server supporting the crash-recovery failure model

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Data replication techniques are widely used for improving availability in software applications. Replicated systems have traditionally assumed the fail-stop model, which limits fault tolerance. For this reason, there is a strong motivation to adopt the crash-recovery model, in which replicas can dynamically leave and join the system. With the aim to point out some key issues that must be considered when dealing with replication and recovery, we have implemented a replicated file server that satisfies the crash-recovery model, making use of a Group Communication System. According to our experiments, the most interesting results are that the type of replication and the number of replicas must be carefully determined, specially in update intensive scenarios; and, the variable overhead imposed by the recovery protocol to the system. From the latter, it would be convenient to adjust the desired trade-off between recovery time and system throughput in terms of the service state size and the number of missed operations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alsberg P, Day JD (1976) A principle for resilient sharing of distributed resources. In: Proceedings of the 2nd international conference on software engineering (ICSE). IEEE Computer Society, Los Alamitos, pp 562–570

    Google Scholar 

  2. Armendáriz-Íñigo JE, González de Mendívil JR, Garitagoitia JR, Muñoz-Escoí FD (2009) Correctness proof of a database replication protocol under the perspective of the I/O automaton model. Acta Inf 46(4):297–330

    Article  MATH  Google Scholar 

  3. Bartoli A (1999) Reliable distributed programming in asynchronous distributed systems with group communication. Tech rep, Università di Trieste, Trieste, Italy

  4. Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency control and recovery in database systems. Addison-Wesley, Reading

    Google Scholar 

  5. Birman KP (2005) Reliable distributed systems: technologies, web services, and applications. Springer, Berlin. http://www.truststc.org/pubs/47.html

    MATH  Google Scholar 

  6. Budhiraja N, Marzullo K, Schneider FB, Toueg S (1992) Primary-backup protocols: lower bounds and optimal implementations. In: Proceedings of the 3rd IFIP conference on dependable computing for critical applications (DCCA), pp 187–198

  7. Budhiraja N, Marzullo K, Schneider FB, Toueg S (1993) The primary-backup approach. In: Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York

    Google Scholar 

  8. Chockler G, Keidar I, Vitenberg R (2001) Group communication specifications: a comprehensive study. ACM Comput Surv 33(4):427–469

    Article  Google Scholar 

  9. Cox R, Muthitacharoen A, Morris R (2002) Serving dns using a peer-to-peer lookup service. In: Druschel P, Kaashoek MF, Rowstron AIT (eds) IPTPS. Lecture notes in computer science, vol 2429. Springer, Berlin, pp 155–165

    Google Scholar 

  10. Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–78

    Article  Google Scholar 

  11. Défago X, Schiper A, André P (2004) Total order broadcast and multicast algorithms: taxonomy and survey. ACM Comput Surv 36(4):372–421. http://doi.acm.org/10.1145/1041680.1041682

    Article  Google Scholar 

  12. Dolev D, Malki D (1996) The Transis approach to high availability cluster communication. Commun ACM 39(4):64–70. http://doi.acm.org/10.1145/227210.227227

    Article  Google Scholar 

  13. Domenici A, Donno F, Pucciani G, Stockinger H, Stockinger K (2004) Replica consistency in a data grid. Nucl Instrum Methods Phys Res A 534:24–28

    Article  Google Scholar 

  14. Dwork C, Lynch NA, Stockmeyer LJ (1988) Consensus in the presence of partial synchrony. J ACM 35(2):288–323

    Article  MathSciNet  Google Scholar 

  15. Gopalakrishnan V, Silaghi BD, Bhattacharjee B, Keleher PJ (2004) Adaptive replication in peer-to-peer systems. In: ICDCS. IEEE Computer Society, Los Alamitos, pp 360–369

    Google Scholar 

  16. Gray J, Helland P, O’Neil P, Shasha D (1996) The dangers of replication and a solution. In: SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 173–182

    Chapter  Google Scholar 

  17. Holliday J (2001) Replicated database recovery using multicast communication. In: IEEE international symposium on network computing and applications (NCA). IEEE Computer Society, Los Alamitos, pp 104–107

    Google Scholar 

  18. Jiménez-Peris R, Patiño-Martínez M, Alonso G (2002) Non-intrusive, parallel recovery of replicated data. In: Proceedings of the 21st symposium on reliable distributed systems (SRDS). IEEE Computer Society, Los Alamitos, pp 150–159

    Google Scholar 

  19. de Juan-Marín R (2008) Crash recovery with partial amnesia failure model issues. PhD thesis, Universidad Politécnica de Valencia, Valencia, Spain

  20. Kemme B, Bartoli A, Babaoğlu Ö (2001) Online reconfiguration in replicated databases based on group communication. In: Proceedings of the international conference on dependable systems and networks (DSN). IEEE Computer Society, Los Alamitos, pp 117–130

    Chapter  Google Scholar 

  21. Malloth C, Felber P, Schiper A, Wilhelm U (1995) Phoenix—a toolkit for building fault-tolerant, distributed applications in large scale. In: Proceedings of the IEEE workshop on parallel and distributed platforms in industrial products

  22. Moser LE, Melliar-Smith PM, Agarwal DA, Budhia RK, Lingley-Papadopoulos CA (1996) Totem: a fault-tolerant multicast group communication system. Commun ACM 39(4):54–63

    Article  Google Scholar 

  23. PostgreSQL: The world’s most advanced open source database: Postgresql 8.3 documentation. Accessible in URL: http://www.postgresql.org (2010)

  24. Rowstron AIT, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP, pp 188–201

  25. Schiper A (2006) Dynamic group communication. Distrib Comput 18(5):359–374

    Article  Google Scholar 

  26. Schneider FB (1993) Replication management using the state-machine approach. In: Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York

    Google Scholar 

  27. Schneider FB (1993) What good are models and what models are good? Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York

    Google Scholar 

  28. Shankar AU (1993) An introduction to assertional reasoning for concurrent systems. ACM Comput Surv 25(3):225–262

    Article  Google Scholar 

  29. Shen HH (2010) IRM: integrated file replication and consistency maintenance in P2P systems. IEEE Trans Parallel Distrib Syst 21(1):100–113

    Article  Google Scholar 

  30. Stanton JR (2010) The spread communication toolkit. Accessible in URL: http://www.spread.org

  31. Venugopal S, Buyya R, Ramamohanarao K (2006) A taxonomy of data grids for distributed data sharing, management, and processing. ACM Comput Surv 38(1)

  32. Vilaça R, Oliveira R, Pereira J, Armendáriz-Iñigo JE, de Mendívil JRG (2009) On the hardness of database clusters reconfiguration. In: Proceedings of the 28th international symposium on reliable distributed systems (SRDS). IEEE Computer Society, Los Alamitos, pp 259–267. http://doi.ieeecomputersociety.org/10.1109/SRDS.2009.27

    Chapter  Google Scholar 

  33. Vogels W (2009) Eventually consistent. Commun ACM 52(1):40–44

    Article  Google Scholar 

  34. Yang CT, Fu CP, Hsu CH (2009) File replication, maintenance, and consistency management services in data grids. J Supercomput. doi:10.1007/s11227-009-0302-9

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Enrique Armendáriz-Iñigo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arrieta-Salinas, I., Armendáriz-Iñigo, J.E., Juárez-Rodríguez, J.R. et al. An implementation of a replicated file server supporting the crash-recovery failure model. J Supercomput 59, 156–202 (2012). https://doi.org/10.1007/s11227-010-0431-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-010-0431-1

Keywords

Navigation