Deploying fault-tolerance and task migration with NetSolve

Plank, James S.; Casanova, Henri; Beck, Micah; Dongarra, Jack

doi:10.1007/BFb0095364

James S. Plank¹,
Henri Casanova¹,
Micah Beck¹ &
…
Jack Dongarra^1,2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1541))

Included in the following conference series:

International Workshop on Applied Parallel Computing

123 Accesses

Abstract

Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve’s structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve.

This work was supported by the Applied Mathematical Sciences Research Programm, Office of Energy Research, U.S. Department of Energy, under contract DE-AL04-94AL85000 with Lockheed Martin Energy Research Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, 29(2): 18–28, February, 1996.
Google Scholar
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, Second Edition, SIAM, Philadelphia, PA, 1995.
MATH Google Scholar
D. E. Bakken and R. D. Schilchting. Supporting Fault-Tolerant Parallel Programming in Linda. IEEE Transactions on Parallel and Distributed Systems, 6(3):287–302, March 1995.
Article Google Scholar
A. Baratloo, P. Dasgupta, and Z. M. Kedem. Calypso: A Novel Software System for Fault-Tolerant Parallel Processing on Distributed Platoform. In 4th IEEE International Symposium on High Performance Distributed Computing, August 1995.
Google Scholar
A. Beguelin, E. Seligman, and P. Stephan. Application Level Fault Tolerance in Heterogeneous Networks of Workstations. Journal of Parallel and Distributed Computing, September 1997.
Google Scholar
L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.
MATH Google Scholar
D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating Point Fault Tolerance with Backward Error Assertions. IEEE Transactions on Computers, 44(2), February 1995.
Google Scholar
G. Cabillic, G. Muller, and I. Puaut. The Performance of Consistent Checkpointing in Distributed Shared Memory Systems. In Proceedings of the 1995 European Intel Supercomputer Users’ Group Meeting, 1995.
Google Scholar
H. Casanova and J. Dongarra. NetSolve’s Network Enabled Server: Examples and Applications. IEEE Computational Science & Engineering, tp appear.
Google Scholar
J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users’ Group Meeting, Pittsburgh, PA, May 1995.
Google Scholar
M. Castro, P. Guedes, M. Sequeira, and M. Costa. A checkpoint protocol for an entry consistent shared memory system. In Thirteenth ACM Symposium on Principles of Distributed Computing, Los Angeles, CA, August 1994.
Google Scholar
Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for Message-Passing Parallel Programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.
Google Scholar
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
Google Scholar
D. Cummings and L. Alkalaj. Checkpoint/Rollback in a Distributed System Using Coarse-Grained Dataflow. In 24th International Symposium on Fault-Tolerant Computing, pages 424–433, Austin, TX, June 1994.
Google Scholar
J. Czyzyk, M. Mesnier, and J. Moré. NEOS: The Network-Enabled Optimization System. Technical Report MCS-P615-1096, Mathematics and Computer Science Division, Argonne National Laboratory, 1996.
Google Scholar
M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and H. M. Levy. Implementing Global Memory Management in a Workstation Cluster. In 15th Symposium on Operating Systems Principles, pages 201–212. ACM, December 1995.
Google Scholar
I. Foster, C. Kesselman, C. Lee, G. von Laszewski, and P. Stelling. A Fault Detection Service for Wide Area Distributed Computations. In Proc. of the High Performance Distributed Computing Conference, to appear.
Google Scholar
I. Foster and K Kesselman. Globus: A Metacomputing Infrastructure Toolkit. In Proc. Workshop on Environments and Tools. SIAM, to appear.
Google Scholar
A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Jr. Reynolds. A Synopsis of the Legion Project. Technical Report CS-94-20, Department of Computer Science, University of Virginia, 1994.
Google Scholar
K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518–528, June 1984.
MATH Google Scholar
The Math Works Inc. MATLAB Reference Guide. 1992.
Google Scholar
G. Janakiraman and Y. Tamir. Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers. In 13th Symposium on Reliable Distributed Systems, pages 42–51, October 1994.
Google Scholar
K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High-Performance All-Software Distributed Shared Memory. In 15th Symposium on Operating Systems Principles, pages 213–228. ACM, December 1995.
Google Scholar
Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations using Checksum and Reverse Computation. In 6th Symposium on the Fontiers of Massively Parallel Computation, October 1996.
Google Scholar
M. Litzkow and M. Livny. Experience with the Condor Distributed Batch System. In Proc. of IEEE Workshop on Experimental Distributed Systems. Department of Computer Science, University of Winsconsin, Madison, 1990.
Google Scholar
M. W. Mutka and M. Livny. The available capacity of a privately owned workstation environment. Perfomance Evaluation, August 1991.
Google Scholar
V. K. Naik, S. P. Midkiff, and J. E. Moreira. A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems. In SC97: High Performance Networking and Computing, San Jose, November 1997.
Google Scholar
D. A. Nichols. Using Idle Workstations in a Shared Computing Environment. Operating Systems Review: Proceedings of SOSP-11, 21(5):5–12, November 1987.
Article Google Scholar
R. Orfali and D. Harkey. Client/Server Programming with Java and CORBA. John Wiley & Sons, Inc, 1997.
Google Scholar
J. S. Plank, Y. Kim, and J. Dongarra. Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing. Journal of Parallel and Distributed Computing, 43:125–138, September 1997.
Article Google Scholar
J. Pruyne and M. Livny. Parallel Processing on Dynamic Resources with CARMI. In First IPPS Workshop on Job Scheduling Strategies for Parallel Processing, April 1995.
Google Scholar
B. Ramkumar and V. Strumpen. Portable Checkpointing and Recovery in Heterogeneous Environments. In 27th International Symposium on Fault-Tolerant Computing, 1997.
Google Scholar
D. J. Scales and M. S. Lam. Transparent Fault Tolerance for Parallel Applications on Networks of Workstations. In Usenix 1996 Technical Conference on UNIX and Advanced Computing Systems, San Diego, January 1996.
Google Scholar
S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka, and U. Nagashima. Ninf: Network based Information Library for Globally High Performance Computing. In Proc. of Parallel Object-Oriented Methods and Applications (POOMA), Santa Fe, 1996.
Google Scholar
L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke. Portable Checkpointing and Recovery. In Proceedings of the HPDC-4, High-Performance Distributed Computing, pages 188–195, Washington, DC, August 1995.
Google Scholar
L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD Applications on Transputer Networks. In Scalable High Performance Computing Conference, pages 694–701, Knoxville, TN, May 1994.
Google Scholar
B. Steensgaard and E. Jul. Object and native code thread mobility among heterogeneous computers. In 15th Symposium on Operating Systems Principles, pages 68–78. ACM, December 1995.
Google Scholar
G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In 10th International Parallel Processing Symposium, April 1996.
Google Scholar
G. Suri, B. Janssens, and W. K. Fuchs. Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory. In 24th International Symposium on Fault-Tolerant Computing, pages 279–288, June 1994.
Google Scholar
N. H. Vaidya. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme. IEEE Transactions on Computers, 46(8):942–947, August 1997.
Article Google Scholar
S. Wolfram. The Mathematical Book Third Edition. Wolfram Median, Inc. and Cambridge University Press, 1996.
Google Scholar
R. Wolski. Dynamically forecasting network performance to support dynamic scheduling using the Network Weather Service. In 6th High-Performance Distributed Computing Conference, August 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Computing Sciences Department, University fo Tennessee, 37996, Knoxville, TN
James S. Plank, Henri Casanova, Micah Beck & Jack Dongarra
Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6367, Oak Ridge, TN
Jack Dongarra

Authors

James S. Plank
View author publications
You can also search for this author in PubMed Google Scholar
Henri Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Micah Beck
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Bo Kågström Jack Dongarra Erik Elmroth Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Plank, J.S., Casanova, H., Beck, M., Dongarra, J. (1998). Deploying fault-tolerance and task migration with NetSolve. In: Kågström, B., Dongarra, J., Elmroth, E., Waśniewski, J. (eds) Applied Parallel Computing Large Scale Scientific and Industrial Problems. PARA 1998. Lecture Notes in Computer Science, vol 1541. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0095364

Download citation

DOI: https://doi.org/10.1007/BFb0095364
Published: 20 October 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65414-8
Online ISBN: 978-3-540-49261-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics