Abstract
Data generation has increased drastically over the past few years due to the rapid development of Internet-based technologies. This period has been called the big data era. Big data offer an emerging paradigm shift in data exploration and utilization. The MapReduce computational paradigm is a well-known framework and is considered the main enabler for the distributed and scalable processing of a large amount of data. However, despite recent efforts toward improving the performance of MapReduce, scheduling MapReduce jobs across multiple nodes has been considered a multi-objective optimization problem. This problem can become increasingly complex when virtualized clusters in cloud computing are used to execute a large number of tasks. This study aims to optimize MapReduce job scheduling based on the completion time and cost of cloud service models. First, the problem is formulated as a multi-objective model. The model consists of two objective functions, namely, (i) completion time and (ii) cost minimization. Second, a scheduling algorithm using earliest finish time scheduling that considers resource allocation and job scheduling in the cloud is proposed. Lastly, experimental results show that the proposed scheduler exhibits better performance than other well-known schedulers, such as FIFO and Fair.
Similar content being viewed by others
References
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endowment 2(1):922–933
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A et al (2010) A view of cloud computing. Commun ACM 53(4):50–58
Bittencourt LF, Madeira ERM (2011) HCOC: a cost optimization algorithm for workflow scheduling in hybrid clouds. J Internet Serv Appl 2(3):207–227
Chang H, Kodialam M, Kompella RR, Lakshman T, Lee M, Mukherjee S (2011) Scheduling in mapreduce-like systems for fast completion time. Paper presented at the INFOCOM, 2011 Proceedings IEEE
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77
Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380
Durillo JJ, Prodan R (2014) Multi-objective workflow scheduling in amazon EC2. Clust Comput 17(2):169–189
Guo Z, Fox G, Zhou M, Ruan Y (2012) Improving resource utilization in mapreduce. Paper presented at the CLUSTER computing (CLUSTER), 2012 I.E. international conference on
Hadoop A (2009) Fair Scheduler https://hadoop.apache.org/docs/stable1/fair_scheduler.html
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115
Heintz B, Chandra A, Sitaraman RK (2012) Optimizing mapreduce for highly distributed environments. arXiv preprint arXiv:1207.7055
Huang S, Huang J, Dai J, Xie T, Huang B (2011) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. New Frontiers in Information and Software as Services,Springer, pp 209–228
Hussain H, Malik SUR, Hameed A, Khan SU, Bickler G, Min-Allah N et al (2013) A survey on resource allocation in high performance distributed computing systems. Parallel Comput 39(11):709–736
Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2012) Maestro: replica-aware map scheduling for mapreduce. Paper presented at the cluster, cloud and grid computing (CCGrid), 2012 12th IEEE/ACM international symposium on
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy: fair scheduling for distributed computing clusters. Paper presented at the Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles
Jagadish H (2015) Big data and science: myths and reality. Big Data Res 2(2):49–52
Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of mapreduce: an in-depth study. Proc VLDB Endowment 3(1–2):472–483
Kc K, Anyanwu K (2010) Scheduling hadoop jobs to meet deadlines. Paper presented at the cloud computing Technology and science (CloudCom), 2010 I.E. Second international conference on
Krish K, Anwar A, Butt AR (2014) [phi]Sched: a heterogeneity-aware Hadoop workflow scheduler. Paper presented at the Modelling, Analysis & Simulation of computer and telecommunication systems (MASCOTS), 2014 I.E. 22nd international symposium on
Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T-M-T,. .. Miettinen M (2012) The mobile data challenge: big data for mobile computing research. Paper presented at the Proceedings of the Workshop on the Nokia Mobile Data Challenge, in Conjunction with the 10th International Conference on Pervasive Computing
Li J-J, Cui J, Wang D, Yan L, Huang Y-S (2011) Survey of MapReduce parallel programming model. Dianzi Xuebao (Acta Electron Sin) 39(11):2635–2642
Long S-Q, Zhao Y-L, Chen W (2014) MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Archit 60(2):234–244
Lopes RV, & Menasce D (2016) A taxonomy of job scheduling on distributed computing systems. IEEE Transactions on Parallel and Distributed Systems 27(12):3412–3428
Medhane DV, Sangaiah AK (2017) Search space-based multi-objective optimization evolutionary algorithm. Comput Electr Eng 58:126–143
Mundkur P, Tuulos V, Flatow J (2011) Disco: a computing platform for large-scale data analytics. Paper presented at the Proceedings of the 10th ACM SIGPLAN workshop on Erlang
Nita M-C, Pop F, Voicu C, Dobre C, Xhafa F (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Clust Comput 18:1–14
Philip Chen CL, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347. doi:10.1016/j.ins.2014.01.015
Rasooli A, Down DG (2014) COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems. Futur Gener Comput Syst 36:1–15
Sakr S, Liu A, Fayoumi AG (2013) The family of MapReduce and large-scale data processing systems. ACM Comput Surv (CSUR) 46(1):11
Tiwari N, Sarkar S, Bellur U, Indrawan M (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv (CSUR) 47(3):49
Valvag SV, Johansen D (2008) Oivos: simple and efficient distributed data processing. Paper presented at the high performance computing and communications, 2008. HPCC'08. 10th IEEE international conference on
Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. Cloud Comput, IEEE Trans 2(3):306–319
Yoo D, Sim KM (2011) A comparative review of job scheduling for MapReduce. Paper presented at the cloud computing and intelligence systems (CCIS), 2011 I.E. international conference on
Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. Paper presented at the OSDI
Zhang X, Zhong Z, Feng S, Tu B, Fan J (2011) Improving data locality of MapReduce by scheduling in homogeneous computing environments. Paper presented at the parallel and distributed processing with applications (ISPA), 2011 I.E. 9th international symposium on
Zhang W, Rajasekaran S, Wood T, Zhu M (2014) Mimp: Deadline and interference aware scheduling of hadoop virtual machines. Paper presented at the Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Acknowledgments
This paper is financially supported by by University Malaya Research Grant Programme (Equitable Society) under grant RP032B-16SBS.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Hashem, I.A.T., Anuar, N.B., Marjani, M. et al. Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77, 9979–9994 (2018). https://doi.org/10.1007/s11042-017-4685-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4685-y