Abstract
MapReduce has demonstrated itself to be as a highly efficient programming model for processing massive dataset on the distributed system. One of the most important obstacles hindering the performance of MapReduce is data skewness. The presence of data skewness leads to considerable load imbalance on the reducers and performance degradation. In this paper, the problem of how to efficiently accommodate intermediate data to even up the load of all reducers is studied when encountering skewed data. A scalable sampling algorithm is used which it can observe a more precise approximate distribution of the keys by sampling only a small fraction of the intermediate data. Afterwards, it is applied to evaluate the overall distribution of the keys. In addition, we propose a sorted-balance algorithm based on sampling results: sorted-balance algorithm using scalable simple random sampling (SBaSC). This work not only puts forward a load-balanced partitioning strategy, but also proves a significant approximation ratio of SBaSC. The experiments confirm that our solution attains a better execution time and load balancing results.
Similar content being viewed by others
References
Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004
Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horizons 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horizons 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Big Data (2018) https://en.wikipedia.org/wiki/Big_data
Wu H (2017) Big data management the mass weather logs. In: Smart Computing and Communication, pp 122–132
Vaidya M (2012) Parallel processing of cluster by MapReduce. Int J Distrib Parallel Syst 3:167–179. https://doi.org/10.5121/ijdps.2012.3113
Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y, Li H (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687. https://doi.org/10.1016/j.compeleceng.2013.07.001
Gufler B, Augsten N, Reiser A, Kemper A (2012) Load balancing in MapReduce based on scalable cardinality estimates. In: IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58
Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, Vol. 28, pp III-531–III-539
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40
Stamos JW, Young HC (1993) A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans Parallel Distrib Syst 4(12):1345–1354. https://doi.org/10.1109/71.250116
Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE Conference on Computer Communications IEEE INFOCOM 2014, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141
Karapiperis D, Verykios VS (2015) Load-balancing the distance computations in record linkage. SIGKDD Explor Newsl 17(1):1–7. https://doi.org/10.1145/2830544.2830546
Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013
Vu L, Alaghband G (2015) A load balancing parallel method for frequent pattern mining on multi-core cluster. In: Proceedings of the Symposium on High Performance Computing, pp 49–58
Kwon Y, Balazinska M, Howe B, Rolia J (2010) Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 75–86. https://doi.org/10.1145/1807128.1807140
Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce work-loads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245
Gufler B, Augsten N, Reiser A, Kemper A (2011) Handling data skew in MapReduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science, CLOSER 2011, pp 1–6
Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L (2010) LEEN: locality/fairness-aware key partitioning for MapReduce in the Cloud. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp 17–24. https://doi.org/10.1109/cloudcom.2010.25
Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840
Martha VS, Zhao W, Xu X (2013) h-MapReduce: a framework for workload balancing in MapReduce. In: 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), pp 637–644. https://doi.org/10.1109/aina.2013.48
Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.20-14.2350972
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in MapReduce for skewed data. In: 2012 Seventh China Grid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18
Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027
Devore JL (2011) Probability and statistics for engineering and the sciences. Nelson Education, Scarborough
Estimating a Proportion for a small, finite population (2018) https://onlinecourses.science.psu.edu/stat414/node/264
Walpole REMRH, Myers SL, Ye K (2011) Probability statistics for engineers and scientists. Pearson Prentice Hall, Upper Saddle River
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1):37–57. https://doi.org/10.1145/3147.3165
Sunter AB (1977) List sequential sampling with equal or unequal probabilities without placement. J R Stat Soc Ser C (Appl Stat) 26(3):261–268. https://doi.org/10.2307/2346966
Blum M, Floyd RW, Pratt V, Rivest RL, Tarjan RE (1973) Time bounds for selection. J Comput Syst Sci 7(4):448–461. https://doi.org/10.1016/S0022-0000(73)80033-9
Graham RL, Lawler EL, Lenstra JK, Kan AHGR (1979) Optimization and approximation in deterministic sequencing and scheduling: a survey. In: Hammer PL, Johnson EL, Korte BH (eds) Annals of discrete mathematics, vol 5. Elsevier, Amsterdam, pp 287–326. https://doi.org/10.1016/S0167-5060(08)70356-X
Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039
Kleinberg J, Tardos É (2006) Algorithm design. Pearson/Addison-Wesley, Boston
Williamson DP, Shmoys DB (2011) The design of approximation algorithms. Cambridge University Press, Cambridge
Jimmy L (2009) The curse of Zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop
Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston
Apache Spark Examples (2017) https://spark.apache.org/examples.html
Range Partitioner (2017) https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200
Khatami Z, Hong S, Lee J, Depner S, Chafi H, Ramanujam J, Kaiser H (2017) A load-balanced parallel and distributed sorting algorithm implemented with PGX.D. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1317–1324. https://doi.org/10.1109/IPDPSW.2017.30
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput 74, 3415–3440 (2018). https://doi.org/10.1007/s11227-018-2391-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2391-9