An implementation of a replicated file server supporting the crash-recovery failure model

Arrieta-Salinas, Itziar; Armendáriz-Iñigo, José Enrique; Juárez-Rodríguez, José Ramón; González de Mendívil, José Ramón

doi:10.1007/s11227-010-0431-1

An implementation of a replicated file server supporting the crash-recovery failure model

Published: 14 April 2010

Volume 59, pages 156–202, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Itziar Arrieta-Salinas¹,
José Enrique Armendáriz-Iñigo¹,
José Ramón Juárez-Rodríguez¹ &
…
José Ramón González de Mendívil¹

100 Accesses
Explore all metrics

Abstract

Data replication techniques are widely used for improving availability in software applications. Replicated systems have traditionally assumed the fail-stop model, which limits fault tolerance. For this reason, there is a strong motivation to adopt the crash-recovery model, in which replicas can dynamically leave and join the system. With the aim to point out some key issues that must be considered when dealing with replication and recovery, we have implemented a replicated file server that satisfies the crash-recovery model, making use of a Group Communication System. According to our experiments, the most interesting results are that the type of replication and the number of replicas must be carefully determined, specially in update intensive scenarios; and, the variable overhead imposed by the recovery protocol to the system. From the latter, it would be convenient to adjust the desired trade-off between recovery time and system throughput in terms of the service state size and the number of missed operations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Data Replication Scheme for PVFS2

Hybrid Replication Schemes of Processes for Fault-Tolerance Systems in Energy-Efficient Server Clusters

Scalable Byzantine fault-tolerant state-machine replication on heterogeneous servers

Article 21 August 2018

References

Alsberg P, Day JD (1976) A principle for resilient sharing of distributed resources. In: Proceedings of the 2nd international conference on software engineering (ICSE). IEEE Computer Society, Los Alamitos, pp 562–570
Google Scholar
Armendáriz-Íñigo JE, González de Mendívil JR, Garitagoitia JR, Muñoz-Escoí FD (2009) Correctness proof of a database replication protocol under the perspective of the I/O automaton model. Acta Inf 46(4):297–330
Article MATH Google Scholar
Bartoli A (1999) Reliable distributed programming in asynchronous distributed systems with group communication. Tech rep, Università di Trieste, Trieste, Italy
Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency control and recovery in database systems. Addison-Wesley, Reading
Google Scholar
Birman KP (2005) Reliable distributed systems: technologies, web services, and applications. Springer, Berlin. http://www.truststc.org/pubs/47.html
MATH Google Scholar
Budhiraja N, Marzullo K, Schneider FB, Toueg S (1992) Primary-backup protocols: lower bounds and optimal implementations. In: Proceedings of the 3rd IFIP conference on dependable computing for critical applications (DCCA), pp 187–198
Budhiraja N, Marzullo K, Schneider FB, Toueg S (1993) The primary-backup approach. In: Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York
Google Scholar
Chockler G, Keidar I, Vitenberg R (2001) Group communication specifications: a comprehensive study. ACM Comput Surv 33(4):427–469
Article Google Scholar
Cox R, Muthitacharoen A, Morris R (2002) Serving dns using a peer-to-peer lookup service. In: Druschel P, Kaashoek MF, Rowstron AIT (eds) IPTPS. Lecture notes in computer science, vol 2429. Springer, Berlin, pp 155–165
Google Scholar
Cristian F (1991) Understanding fault-tolerant distributed systems. Commun ACM 34(2):56–78
Article Google Scholar
Défago X, Schiper A, André P (2004) Total order broadcast and multicast algorithms: taxonomy and survey. ACM Comput Surv 36(4):372–421. http://doi.acm.org/10.1145/1041680.1041682
Article Google Scholar
Dolev D, Malki D (1996) The Transis approach to high availability cluster communication. Commun ACM 39(4):64–70. http://doi.acm.org/10.1145/227210.227227
Article Google Scholar
Domenici A, Donno F, Pucciani G, Stockinger H, Stockinger K (2004) Replica consistency in a data grid. Nucl Instrum Methods Phys Res A 534:24–28
Article Google Scholar
Dwork C, Lynch NA, Stockmeyer LJ (1988) Consensus in the presence of partial synchrony. J ACM 35(2):288–323
Article MathSciNet Google Scholar
Gopalakrishnan V, Silaghi BD, Bhattacharjee B, Keleher PJ (2004) Adaptive replication in peer-to-peer systems. In: ICDCS. IEEE Computer Society, Los Alamitos, pp 360–369
Google Scholar
Gray J, Helland P, O’Neil P, Shasha D (1996) The dangers of replication and a solution. In: SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 173–182
Chapter Google Scholar
Holliday J (2001) Replicated database recovery using multicast communication. In: IEEE international symposium on network computing and applications (NCA). IEEE Computer Society, Los Alamitos, pp 104–107
Google Scholar
Jiménez-Peris R, Patiño-Martínez M, Alonso G (2002) Non-intrusive, parallel recovery of replicated data. In: Proceedings of the 21st symposium on reliable distributed systems (SRDS). IEEE Computer Society, Los Alamitos, pp 150–159
Google Scholar
de Juan-Marín R (2008) Crash recovery with partial amnesia failure model issues. PhD thesis, Universidad Politécnica de Valencia, Valencia, Spain
Kemme B, Bartoli A, Babaoğlu Ö (2001) Online reconfiguration in replicated databases based on group communication. In: Proceedings of the international conference on dependable systems and networks (DSN). IEEE Computer Society, Los Alamitos, pp 117–130
Chapter Google Scholar
Malloth C, Felber P, Schiper A, Wilhelm U (1995) Phoenix—a toolkit for building fault-tolerant, distributed applications in large scale. In: Proceedings of the IEEE workshop on parallel and distributed platforms in industrial products
Moser LE, Melliar-Smith PM, Agarwal DA, Budhia RK, Lingley-Papadopoulos CA (1996) Totem: a fault-tolerant multicast group communication system. Commun ACM 39(4):54–63
Article Google Scholar
PostgreSQL: The world’s most advanced open source database: Postgresql 8.3 documentation. Accessible in URL: http://www.postgresql.org (2010)
Rowstron AIT, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP, pp 188–201
Schiper A (2006) Dynamic group communication. Distrib Comput 18(5):359–374
Article Google Scholar
Schneider FB (1993) Replication management using the state-machine approach. In: Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York
Google Scholar
Schneider FB (1993) What good are models and what models are good? Distributed systems, 2nd edn. ACM Press/Addison-Wesley, New York
Google Scholar
Shankar AU (1993) An introduction to assertional reasoning for concurrent systems. ACM Comput Surv 25(3):225–262
Article Google Scholar
Shen HH (2010) IRM: integrated file replication and consistency maintenance in P2P systems. IEEE Trans Parallel Distrib Syst 21(1):100–113
Article Google Scholar
Stanton JR (2010) The spread communication toolkit. Accessible in URL: http://www.spread.org
Venugopal S, Buyya R, Ramamohanarao K (2006) A taxonomy of data grids for distributed data sharing, management, and processing. ACM Comput Surv 38(1)
Vilaça R, Oliveira R, Pereira J, Armendáriz-Iñigo JE, de Mendívil JRG (2009) On the hardness of database clusters reconfiguration. In: Proceedings of the 28th international symposium on reliable distributed systems (SRDS). IEEE Computer Society, Los Alamitos, pp 259–267. http://doi.ieeecomputersociety.org/10.1109/SRDS.2009.27
Chapter Google Scholar
Vogels W (2009) Eventually consistent. Commun ACM 52(1):40–44
Article Google Scholar
Yang CT, Fu CP, Hsu CH (2009) File replication, maintenance, and consistency management services in data grids. J Supercomput. doi:10.1007/s11227-009-0302-9

Download references

Author information

Authors and Affiliations

Departamento de Ingeniería Matemática e Informática, Universidad Pública de Navarra, 31006, Pamplona, Spain
Itziar Arrieta-Salinas, José Enrique Armendáriz-Iñigo, José Ramón Juárez-Rodríguez & José Ramón González de Mendívil

Authors

Itziar Arrieta-Salinas
View author publications
You can also search for this author in PubMed Google Scholar
José Enrique Armendáriz-Iñigo
View author publications
You can also search for this author in PubMed Google Scholar
José Ramón Juárez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
José Ramón González de Mendívil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Enrique Armendáriz-Iñigo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arrieta-Salinas, I., Armendáriz-Iñigo, J.E., Juárez-Rodríguez, J.R. et al. An implementation of a replicated file server supporting the crash-recovery failure model. J Supercomput 59, 156–202 (2012). https://doi.org/10.1007/s11227-010-0431-1

Download citation

Published: 14 April 2010
Issue Date: January 2012
DOI: https://doi.org/10.1007/s11227-010-0431-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An implementation of a replicated file server supporting the crash-recovery failure model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A New Data Replication Scheme for PVFS2

Hybrid Replication Schemes of Processes for Fault-Tolerance Systems in Energy-Efficient Server Clusters

Scalable Byzantine fault-tolerant state-machine replication on heterogeneous servers

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An implementation of a replicated file server supporting the crash-recovery failure model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A New Data Replication Scheme for PVFS2

Hybrid Replication Schemes of Processes for Fault-Tolerance Systems in Energy-Efficient Server Clusters

Scalable Byzantine fault-tolerant state-machine replication on heterogeneous servers

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation