Abstract
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve’s structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve.
This work was supported by the Applied Mathematical Sciences Research Programm, Office of Energy Research, U.S. Department of Energy, under contract DE-AL04-94AL85000 with Lockheed Martin Energy Research Corporation.
Preview
Unable to display preview. Download preview PDF.
References
C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, 29(2): 18–28, February, 1996.
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users’ Guide, Second Edition, SIAM, Philadelphia, PA, 1995.
D. E. Bakken and R. D. Schilchting. Supporting Fault-Tolerant Parallel Programming in Linda. IEEE Transactions on Parallel and Distributed Systems, 6(3):287–302, March 1995.
A. Baratloo, P. Dasgupta, and Z. M. Kedem. Calypso: A Novel Software System for Fault-Tolerant Parallel Processing on Distributed Platoform. In 4th IEEE International Symposium on High Performance Distributed Computing, August 1995.
A. Beguelin, E. Seligman, and P. Stephan. Application Level Fault Tolerance in Heterogeneous Networks of Workstations. Journal of Parallel and Distributed Computing, September 1997.
L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.
D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating Point Fault Tolerance with Backward Error Assertions. IEEE Transactions on Computers, 44(2), February 1995.
G. Cabillic, G. Muller, and I. Puaut. The Performance of Consistent Checkpointing in Distributed Shared Memory Systems. In Proceedings of the 1995 European Intel Supercomputer Users’ Group Meeting, 1995.
H. Casanova and J. Dongarra. NetSolve’s Network Enabled Server: Examples and Applications. IEEE Computational Science & Engineering, tp appear.
J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users’ Group Meeting, Pittsburgh, PA, May 1995.
M. Castro, P. Guedes, M. Sequeira, and M. Costa. A checkpoint protocol for an entry consistent shared memory system. In Thirteenth ACM Symposium on Principles of Distributed Computing, Los Angeles, CA, August 1994.
Y. Chen, J. S. Plank, and K. Li. CLIP: A Checkpointing Tool for Message-Passing Parallel Programs. In SC97: High Performance Networking and Computing, San Jose, November 1997.
P. E. Chung, Y. Huang, S. Yajnik, G. Fowler, K. P. Vo, and Y. M. Wang. Checkpointing in CosMiC: a user-level process migration environment. In Pacific Rim International Symposium on Fault-Tolerant Systems, December 1997.
D. Cummings and L. Alkalaj. Checkpoint/Rollback in a Distributed System Using Coarse-Grained Dataflow. In 24th International Symposium on Fault-Tolerant Computing, pages 424–433, Austin, TX, June 1994.
J. Czyzyk, M. Mesnier, and J. Moré. NEOS: The Network-Enabled Optimization System. Technical Report MCS-P615-1096, Mathematics and Computer Science Division, Argonne National Laboratory, 1996.
M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and H. M. Levy. Implementing Global Memory Management in a Workstation Cluster. In 15th Symposium on Operating Systems Principles, pages 201–212. ACM, December 1995.
I. Foster, C. Kesselman, C. Lee, G. von Laszewski, and P. Stelling. A Fault Detection Service for Wide Area Distributed Computations. In Proc. of the High Performance Distributed Computing Conference, to appear.
I. Foster and K Kesselman. Globus: A Metacomputing Infrastructure Toolkit. In Proc. Workshop on Environments and Tools. SIAM, to appear.
A. Grimshaw, W. Wulf, J. French, A. Weaver, and P. Jr. Reynolds. A Synopsis of the Legion Project. Technical Report CS-94-20, Department of Computer Science, University of Virginia, 1994.
K-H. Huang and J. A. Abraham. Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers, C-33(6):518–528, June 1984.
The Math Works Inc. MATLAB Reference Guide. 1992.
G. Janakiraman and Y. Tamir. Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers. In 13th Symposium on Reliable Distributed Systems, pages 42–51, October 1994.
K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High-Performance All-Software Distributed Shared Memory. In 15th Symposium on Operating Systems Principles, pages 213–228. ACM, December 1995.
Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations using Checksum and Reverse Computation. In 6th Symposium on the Fontiers of Massively Parallel Computation, October 1996.
M. Litzkow and M. Livny. Experience with the Condor Distributed Batch System. In Proc. of IEEE Workshop on Experimental Distributed Systems. Department of Computer Science, University of Winsconsin, Madison, 1990.
M. W. Mutka and M. Livny. The available capacity of a privately owned workstation environment. Perfomance Evaluation, August 1991.
V. K. Naik, S. P. Midkiff, and J. E. Moreira. A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems. In SC97: High Performance Networking and Computing, San Jose, November 1997.
D. A. Nichols. Using Idle Workstations in a Shared Computing Environment. Operating Systems Review: Proceedings of SOSP-11, 21(5):5–12, November 1987.
R. Orfali and D. Harkey. Client/Server Programming with Java and CORBA. John Wiley & Sons, Inc, 1997.
J. S. Plank, Y. Kim, and J. Dongarra. Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing. Journal of Parallel and Distributed Computing, 43:125–138, September 1997.
J. Pruyne and M. Livny. Parallel Processing on Dynamic Resources with CARMI. In First IPPS Workshop on Job Scheduling Strategies for Parallel Processing, April 1995.
B. Ramkumar and V. Strumpen. Portable Checkpointing and Recovery in Heterogeneous Environments. In 27th International Symposium on Fault-Tolerant Computing, 1997.
D. J. Scales and M. S. Lam. Transparent Fault Tolerance for Parallel Applications on Networks of Workstations. In Usenix 1996 Technical Conference on UNIX and Advanced Computing Systems, San Diego, January 1996.
S. Sekiguchi, M. Sato, H. Nakada, S. Matsuoka, and U. Nagashima. Ninf: Network based Information Library for Globally High Performance Computing. In Proc. of Parallel Object-Oriented Methods and Applications (POOMA), Santa Fe, 1996.
L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke. Portable Checkpointing and Recovery. In Proceedings of the HPDC-4, High-Performance Distributed Computing, pages 188–195, Washington, DC, August 1995.
L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD Applications on Transputer Networks. In Scalable High Performance Computing Conference, pages 694–701, Knoxville, TN, May 1994.
B. Steensgaard and E. Jul. Object and native code thread mobility among heterogeneous computers. In 15th Symposium on Operating Systems Principles, pages 68–78. ACM, December 1995.
G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In 10th International Parallel Processing Symposium, April 1996.
G. Suri, B. Janssens, and W. K. Fuchs. Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory. In 24th International Symposium on Fault-Tolerant Computing, pages 279–288, June 1994.
N. H. Vaidya. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme. IEEE Transactions on Computers, 46(8):942–947, August 1997.
S. Wolfram. The Mathematical Book Third Edition. Wolfram Median, Inc. and Cambridge University Press, 1996.
R. Wolski. Dynamically forecasting network performance to support dynamic scheduling using the Network Weather Service. In 6th High-Performance Distributed Computing Conference, August 1997.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Plank, J.S., Casanova, H., Beck, M., Dongarra, J. (1998). Deploying fault-tolerance and task migration with NetSolve. In: Kågström, B., Dongarra, J., Elmroth, E., Waśniewski, J. (eds) Applied Parallel Computing Large Scale Scientific and Industrial Problems. PARA 1998. Lecture Notes in Computer Science, vol 1541. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0095364
Download citation
DOI: https://doi.org/10.1007/BFb0095364
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65414-8
Online ISBN: 978-3-540-49261-0
eBook Packages: Springer Book Archive