A Straggler Identification Model for Large-Scale Distributed Computing Systems Using Machine Learning

Said, Samar A.; Habashy, Shahira M.; Salem, Sameh A.; Saad, E. L.-Sayed. M.

doi:10.1007/978-3-031-20601-6_10

Samar A. Said⁷,
Shahira M. Habashy⁷,
Sameh A. Salem⁷ &
…
E. L.-Sayed. M. Saad⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 152))

Included in the following conference series:

International Conference on Advanced Intelligent Systems and Informatics

1138 Accesses

Abstract

Nowadays, Large-Scale Distributed Computing Systems has become crucial for storing, processing, and analyzing massive datasets. Apache Spark endorses a general and efficient programming model for large-scale data processing called Resilient Distributed Dataset (RDD). However, the incidence of stragglers is one of the major issues with the Spark cluster. It results in performance deterioration because a task on a system takes abnormal time to finish execution. In this paper, a straggler identification model for distributed environments using machine learning is proposed. This model employs a several spark parameters extracted by the execution of various types and large scale jobs on to assist in identifying the stragglers. In addition, the proposed model applies machine learning approaches to Spark log to learn various kinds of job execution features. The performance of the introduced model is evaluated across various real-world benchmark datasets using default apache spark across diverse CPU, I/O, and mixed workloads. Furthermore, we have empirically shown that Logistic Regression outperforms and can achieve average accuracy of 90% for straggler identification with comparison to other competitive models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

Straggler identification approach in large data processing frameworks using ensembled gradient boosting in smart-cities cloud services

Article 09 September 2021

Leveraging resource management for efficient performance of Apache Spark

Article Open access 23 August 2019

References

Cardellini, V., Lo Presti, F., Nardelli, M., Russo Russo, G.: Run-time adaptation of data stream processing systems: the state of the art. ACM Comp. Surv. (CSUR) (2022)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10). (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Stoica, I.: Resilient distributed datasets: a {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. (2012)
Google Scholar
Lu, S., Wei, X., Rao, B., Tak, B., Wang, L., Wang, L.: LADRA: log-based abnormal task detection and root-cause analysis in big data processing with Spark. Futur. Gener. Comput. Syst. 95, 392–403 (2019)
Article Google Scholar
Gill, S.S., Ouyang, X., Garraghan, P.: Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres. J. Supercomput. 76(12), 10050–10089 (2020). https://doi.org/10.1007/s11227-020-03241-x
Article Google Scholar
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Article Google Scholar
Said, S.A., El-Sayed, M.S., Salem, S.A., Habashy, S.M.: A speculative execution framework for big data processing systems. In: 2021 International Conference on Information Technology (ICIT), pp. 616–621. IEEE. (2021)
Google Scholar
Xu, H., Lau, W.C.: Optimization for speculative execution in big data processing clusters. IEEE Trans. Parallel Distrib. Syst. 28(2), 530–545 (2016)
Google Scholar
Garraghan, P., Ouyang, X., Yang, R., McKee, D., Xu, J.: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Serv. Comput. 12(1), 91–104 (2016)
Article Google Scholar
Phan, T.D., Pallez, G., Ibrahim, S., Raghavan, P.: A new framework for evaluating straggler detection mechanisms in mapreduce. ACM Trans. Model. Perform. Eval. Comp. Syst. (TOMPECS) 4(3), 1–23 (2019)
Article Google Scholar
Deshmukh, S., Thirupathi Rao, K., Shabaz, M.: Collaborative learning based straggler prevention in large-scale distributed computing framework. Sec. Commun. Netw. (2021)
Google Scholar
Zheng, P., Lee, B.C.: Hound: Causal learning for datacenter-scale straggler diagnosis. Proc. ACM Meas. Anal. Comp. Syst. 2(1), 1–36 (2018)
Article Google Scholar
Kleinbaum, D.G., Dietz, K., Gail, M., Klein, M., Klein, M.: Logistic regression, p. 536. Springer-Verlag, New York (2002)
Google Scholar
Belgiu, M., Drăguţ, L.: Random forest in remote sensing: a review of applications and future directions. ISPRS J. Photogramm. Remote. Sens. 114, 24–31 (2016)
Article Google Scholar
Huang, X., Shi, L., Suykens, J.A.: Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 984–997 (2013)
Article Google Scholar
Abu Alfeilat, H.A., et al.: Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data 7(4), 221–248 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Systems Engineering, Faculty of Engineering, Helwan University, Cairo, Egypt
Samar A. Said, Shahira M. Habashy, Sameh A. Salem & E. L.-Sayed. M. Saad

Authors

Samar A. Said
View author publications
You can also search for this author in PubMed Google Scholar
Shahira M. Habashy
View author publications
You can also search for this author in PubMed Google Scholar
Sameh A. Salem
View author publications
You can also search for this author in PubMed Google Scholar
E. L.-Sayed. M. Saad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samar A. Said .

Editor information

Editors and Affiliations

Faculty of Computers Artificial Intelligence, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Electrical Engineering and Computer Science, VŠB-Technical University of Ostrava, Ostrava-Poruba, Moravskoslezsky, Czech Republic
Václav Snášel
International Center for Informatics Research, Beijing Jaiotong University, Beijing, China
Mincong Tang
College of Computer Science and Mathematics, Fujian University of Technology, Fuzhou, Fujian, China
Tien-Wen Sung
Fujian University of Technology, New Taipei, Taiwan
Kuo-Chi Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Said, S.A., Habashy, S.M., Salem, S.A., Saad, E.LS.M. (2023). A Straggler Identification Model for Large-Scale Distributed Computing Systems Using Machine Learning. In: Hassanien, A.E., Snášel, V., Tang, M., Sung, TW., Chang, KC. (eds) Proceedings of the 8th International Conference on Advanced Intelligent Systems and Informatics 2022. AISI 2022. Lecture Notes on Data Engineering and Communications Technologies, vol 152. Springer, Cham. https://doi.org/10.1007/978-3-031-20601-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-20601-6_10
Published: 18 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20600-9
Online ISBN: 978-3-031-20601-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A Straggler Identification Model for Large-Scale Distributed Computing Systems Using Machine Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

Straggler identification approach in large data processing frameworks using ensembled gradient boosting in smart-cities cloud services

Leveraging resource management for efficient performance of Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Straggler Identification Model for Large-Scale Distributed Computing Systems Using Machine Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

Straggler identification approach in large data processing frameworks using ensembled gradient boosting in smart-cities cloud services

Leveraging resource management for efficient performance of Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation