Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

Gavagsaz, Elaheh; Rezaee, Ali; Haj Seyyed Javadi, Hamid

doi:10.1007/s11227-018-2391-9

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

Published: 30 April 2018

Volume 74, pages 3415–3440, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Elaheh Gavagsaz¹,
Ali Rezaee¹ &
Hamid Haj Seyyed Javadi²

500 Accesses
12 Citations
Explore all metrics

Abstract

MapReduce has demonstrated itself to be as a highly efficient programming model for processing massive dataset on the distributed system. One of the most important obstacles hindering the performance of MapReduce is data skewness. The presence of data skewness leads to considerable load imbalance on the reducers and performance degradation. In this paper, the problem of how to efficiently accommodate intermediate data to even up the load of all reducers is studied when encountering skewed data. A scalable sampling algorithm is used which it can observe a more precise approximate distribution of the keys by sampling only a small fraction of the intermediate data. Afterwards, it is applied to evaluate the overall distribution of the keys. In addition, we propose a sorted-balance algorithm based on sampling results: sorted-balance algorithm using scalable simple random sampling (SBaSC). This work not only puts forward a load-balanced partitioning strategy, but also proves a significant approximation ratio of SBaSC. The experiments confirm that our solution attains a better execution time and load balancing results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Investigation of Sample Versus Normal Map for Effective BigData Processing

Load balancing in join algorithms for skewed data in MapReduce systems

Article 01 September 2018

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

References

Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004
Article Google Scholar
Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horizons 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002
Article Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horizons 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Article Google Scholar
Big Data (2018) https://en.wikipedia.org/wiki/Big_data
Wu H (2017) Big data management the mass weather logs. In: Smart Computing and Communication, pp 122–132
Vaidya M (2012) Parallel processing of cluster by MapReduce. Int J Distrib Parallel Syst 3:167–179. https://doi.org/10.5121/ijdps.2012.3113
Article Google Scholar
Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687. https://doi.org/10.1016/j.compeleceng.2013.07.001
Article Google Scholar
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in MapReduce based on scalable cardinality estimates. In: IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58
Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, Vol. 28, pp III-531–III-539
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40
Stamos JW, Young HC (1993) A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans Parallel Distrib Syst 4(12):1345–1354. https://doi.org/10.1109/71.250116
Article Google Scholar
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE Conference on Computer Communications IEEE INFOCOM 2014, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141
Karapiperis D, Verykios VS (2015) Load-balancing the distance computations in record linkage. SIGKDD Explor Newsl 17(1):1–7. https://doi.org/10.1145/2830544.2830546
Article Google Scholar
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Google Scholar
Vu L, Alaghband G (2015) A load balancing parallel method for frequent pattern mining on multi-core cluster. In: Proceedings of the Symposium on High Performance Computing, pp 49–58
Kwon Y, Balazinska M, Howe B, Rolia J (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 75–86. https://doi.org/10.1145/1807128.1807140
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce work-loads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245
Gufler B, Augsten N, Reiser A, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science, CLOSER 2011, pp 1–6
Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the Cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp 17–24. https://doi.org/10.1109/cloudcom.2010.25
Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840
Martha VS, Zhao W, Xu X (2013) h-MapReduce: a framework for workload balancing in MapReduce. In: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp 637–644. https://doi.org/10.1109/aina.2013.48
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.20-14.2350972
Article Google Scholar
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh China Grid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027
Article Google Scholar
Devore JL (2011) Probability and statistics for engineering and the sciences. Nelson Education, Scarborough
Google Scholar
Estimating a Proportion for a small, finite population (2018) https://onlinecourses.science.psu.edu/stat414/node/264
Walpole REMRH, Myers SL, Ye K (2011) Probability statistics for engineers and scientists. Pearson Prentice Hall, Upper Saddle River
MATH Google Scholar
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1):37–57. https://doi.org/10.1145/3147.3165
Article MathSciNet MATH Google Scholar
Sunter AB (1977) List sequential sampling with equal or unequal probabilities without placement. J R Stat Soc Ser C (Appl Stat) 26(3):261–268. https://doi.org/10.2307/2346966
MathSciNet Google Scholar
Blum M, Floyd RW, Pratt V, Rivest RL, Tarjan RE (1973) Time bounds for selection. J Comput Syst Sci 7(4):448–461. https://doi.org/10.1016/S0022-0000(73)80033-9
Article MathSciNet MATH Google Scholar
Graham RL, Lawler EL, Lenstra JK, Kan AHGR (1979) Optimization and approximation in deterministic sequencing and scheduling: a survey. In: Hammer PL, Johnson EL, Korte BH (eds) Annals of discrete mathematics, vol 5. Elsevier, Amsterdam, pp 287–326. https://doi.org/10.1016/S0167-5060(08)70356-X
Google Scholar
Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039
Article MathSciNet MATH Google Scholar
Kleinberg J, Tardos É (2006) Algorithm design. Pearson/Addison-Wesley, Boston
Google Scholar
Williamson DP, Shmoys DB (2011) The design of approximation algorithms. Cambridge University Press, Cambridge
Book MATH Google Scholar
Jimmy L (2009) The curse of Zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop
Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston
Google Scholar
Apache Spark Examples (2017) https://spark.apache.org/examples.html
Range Partitioner (2017) https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200
Article Google Scholar
Khatami Z, Hong S, Lee J, Depner S, Chafi H, Ramanujam J, Kaiser H (2017) A load-balanced parallel and distributed sorting algorithm implemented with PGX.D. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1317–1324. https://doi.org/10.1109/IPDPSW.2017.30

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Elaheh Gavagsaz & Ali Rezaee
Department of Applied Mathematics, Faculty of Mathematics and Computer Science, Shahed University, Tehran, Iran
Hamid Haj Seyyed Javadi

Authors

Elaheh Gavagsaz
View author publications
You can also search for this author in PubMed Google Scholar
Ali Rezaee
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Haj Seyyed Javadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Rezaee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74, 3415–3440 (2018). https://doi.org/10.1007/s11227-018-2391-9

Download citation

Published: 30 April 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11227-018-2391-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Investigation of Sample Versus Normal Map for Effective BigData Processing

Load balancing in join algorithms for skewed data in MapReduce systems

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Investigation of Sample Versus Normal Map for Effective BigData Processing

Load balancing in join algorithms for skewed data in MapReduce systems

Research on Optimization of Data Balancing Partition Algorithm Based on Spark Platform

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation