iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://api.crossref.org/works/10.2478/AMCS-2019-0005
{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T09:16:29Z","timestamp":1726478189759},"reference-count":27,"publisher":"University of Zielona G\u00f3ra, Poland","issue":"1","license":[{"start":{"date-parts":[[2019,3,1]],"date-time":"2019-03-01T00:00:00Z","timestamp":1551398400000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2019,3,1]]},"abstract":"Abstract<\/jats:title>\n Today\u2019s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.<\/jats:p>","DOI":"10.2478\/amcs-2019-0005","type":"journal-article","created":{"date-parts":[[2019,4,1]],"date-time":"2019-04-01T17:30:51Z","timestamp":1554139851000},"page":"69-79","source":"Crossref","is-referenced-by-count":4,"title":["Parallelizing user\u2013defined functions in the ETL workflow using orchestration style sheets"],"prefix":"10.61822","volume":"29","author":[{"given":"Syed Muhammad Fawad","family":"Ali","sequence":"first","affiliation":[{"name":"Faculty of Computing , Pozna\u0144 University of Technology , Piotrowo 2, 60-965 Pozna\u0144 , Poland"},{"name":"Data Engineering , trivago N.V. Leipzig, Bosestrasse 4, 04109 , Leipzig , Germany"}]},{"given":"Johannes","family":"Mey","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science , Technical University of Dresden , Helmholtzstrasse 10, 01069 , Dresden , Germany"}]},{"given":"Maik","family":"Thiele","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science , Technical University of Dresden , Helmholtzstrasse 10, 01069 , Dresden , Germany"}]}],"member":"37438","published-online":{"date-parts":[[2019,3,29]]},"reference":[{"key":"2023050302353040807_j_amcs-2019-0005_ref_001_w2aab3b7b4b1b6b1ab1ab1Aa","unstructured":"Ali, S.M.F. (2018). Next-generation ETL framework to address the challenges posed by big data, Workshop Proceedings of the EDBT\/ICDT Joint Conference, Vienna, Austria."},{"key":"2023050302353040807_j_amcs-2019-0005_ref_002_w2aab3b7b4b1b6b1ab1ab2Aa","doi-asserted-by":"crossref","unstructured":"Ali, S.M.F. and Wrembel, R. (2017). From conceptual design to performance optimization of ETL workflows: Current state of research and open problems, The VLDB Journal26(6): 1\u201325.10.1007\/s00778-017-0477-2","DOI":"10.1007\/s00778-017-0477-2"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_003_w2aab3b7b4b1b6b1ab1ab3Aa","doi-asserted-by":"crossref","unstructured":"A\u00dfmann, U. (2003). Invasive software composition, Invasive Software Composition, Springer, Berlin\/Heidelberg, pp. 107\u2013145.","DOI":"10.1007\/978-3-662-05082-8_4"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_004_w2aab3b7b4b1b6b1ab1ab4Aa","doi-asserted-by":"crossref","unstructured":"Battr\u00e9, D., Ewen, S., Hueske, F., Kao, O., Markl, V. and Warneke, D. (2010). Nephele\/PACTs: A programming model and execution framework for web-scale analytical processing, Proceedings of the Symposium on Cloud Computing, Indianapolis, IN, USA, pp. 119\u2013130.","DOI":"10.1145\/1807128.1807148"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_005_w2aab3b7b4b1b6b1ab1ab5Aa","doi-asserted-by":"crossref","unstructured":"Chaiken, R., Jenkins, B., Larson, P.-\u00c5., Ramsey, B., Shakib, D., Weaver, S. and Zhou, J. (2008). Scope: Easy and efficient parallel processing of massive data sets, Proceedings of the VLDB Endowment1(2): 1265\u20131276.10.14778\/1454159.1454166","DOI":"10.14778\/1454159.1454166"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_006_w2aab3b7b4b1b6b1ab1ab6Aa","unstructured":"Cloudera (2016). Example: Sentiment analysis using MapReduce custom counters, https:\/\/www.cloudera.com\/documentation\/other\/tutorial\/CDH5\/topics\/ht_example_4_sentiment_analysis.html."},{"key":"2023050302353040807_j_amcs-2019-0005_ref_007_w2aab3b7b4b1b6b1ab1ab7Aa","doi-asserted-by":"crossref","unstructured":"Dagum, L. and Menon, R. (1998). OpenMP: An industry standard API for shared-memory programming, IEEE Computational Science and Engineering5(1): 46\u201355.10.1109\/99.660313","DOI":"10.1109\/99.660313"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_008_w2aab3b7b4b1b6b1ab1ab8Aa","doi-asserted-by":"crossref","unstructured":"Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters, Communications of the ACM51(1) 107\u2013113.10.1145\/1327452.1327492","DOI":"10.1145\/1327452.1327492"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_009_w2aab3b7b4b1b6b1ab1ab9Aa","doi-asserted-by":"crossref","unstructured":"Ekman, T. and Hedin, G. (2007). The JastAdd system modular extensible compiler construction, Science of Computer Programming69(1\u20133): 14\u201326.10.1016\/j.scico.2007.02.003","DOI":"10.1016\/j.scico.2007.02.003"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_010_w2aab3b7b4b1b6b1ab1ac10Aa","doi-asserted-by":"crossref","unstructured":"Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A. and Jacobsen, H.-A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics, Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, pp. 1197\u20131208.","DOI":"10.1145\/2463676.2463712"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_011_w2aab3b7b4b1b6b1ab1ac11Aa","doi-asserted-by":"crossref","unstructured":"Gonz\u00e1lez-V\u00e9lez, H. and Kontagora, M. (2011). Performance evaluation of MapReduce using full virtualisation on a departmental cloud, International Journal of Applied Mathematics and Computer Science21(2): 275\u2013284, DOI: 10.2478\/v10006-011-0020-3.10.2478\/v10006-011-0020-3","DOI":"10.2478\/v10006-011-0020-3"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_012_w2aab3b7b4b1b6b1ab1ac12Aa","doi-asserted-by":"crossref","unstructured":"Gro\u00dfe, P., May, N. and Lehner, W. (2014). A study of partitioning and parallel UDF execution with the SAP HANA database, Proceedings of the 26th International Conference on Scientific and Statistical Database Management, Aalborg, Denmark, p. 36.","DOI":"10.1145\/2618243.2618274"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_013_w2aab3b7b4b1b6b1ab1ac13Aa","unstructured":"Hedin, G. (2000). Reference attributed grammars, Informatica (Slovenia)24(3): 301\u2013317."},{"key":"2023050302353040807_j_amcs-2019-0005_ref_014_w2aab3b7b4b1b6b1ab1ac14Aa","doi-asserted-by":"crossref","unstructured":"Karagiannis, A., Vassiliadis, P. and Simitsis, A. (2013). Scheduling strategies for efficient ETL execution, Information Systems38(6): 927\u2013945.10.1016\/j.is.2012.12.001","DOI":"10.1016\/j.is.2012.12.001"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_015_w2aab3b7b4b1b6b1ab1ac15Aa","unstructured":"Karol, S. (2015). Well-formed and Scalable Invasive Software Composition, PhD dissertation, Technische Universitat Dresden, Dresden."},{"key":"2023050302353040807_j_amcs-2019-0005_ref_016_w2aab3b7b4b1b6b1ab1ac16Aa","doi-asserted-by":"crossref","unstructured":"Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C., Loingtier, J.-M. and Irwin, J. (1997). Aspect-oriented programming, in M. Ak\u015fit and S. Matsuoka (Eds.), European Conference on Object-oriented Programming, Springer, Berlin\/Heidelberg, pp. 220\u2013242.10.1007\/BFb0053381","DOI":"10.1007\/BFb0053381"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_017_w2aab3b7b4b1b6b1ab1ac17Aa","doi-asserted-by":"crossref","unstructured":"Kumar, N. and Kumar, P.S. (2010). An efficient heuristic for logical optimization of ETL workflows, International Workshop on Business Intelligence for the Real-Time Enterprise, Singapore, Singapore, pp. 68\u201383.","DOI":"10.1007\/978-3-642-22970-1_6"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_018_w2aab3b7b4b1b6b1ab1ac18Aa","doi-asserted-by":"crossref","unstructured":"Liu, X., Thomsen, C. and Pedersen, T.B. (2013). ETLMR: A highly scalable dimensional etl framework based on MaprEduce, in A. Hameurlain et al. (Eds.), Transactions on Large-Scale Data-and Knowledge-Centered Systems VIII, Springer, Berlin\/Heidelberg, pp. 1\u201331.10.1007\/978-3-642-37574-3_1","DOI":"10.1007\/978-3-642-37574-3_1"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_019_w2aab3b7b4b1b6b1ab1ac19Aa","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, pp. 55\u201360.","DOI":"10.3115\/v1\/P14-5010"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_020_w2aab3b7b4b1b6b1ab1ac20Aa","doi-asserted-by":"crossref","unstructured":"Mey, J., Karol, S., A\u00dfmann, U., Huismann, I., Stiller, J. and Fr\u00f6hlich, J. (2016). Using semantics-aware composition and weaving for multi-variant progressive parallelization, Procedia Computer Science80: 1554\u20131565.10.1016\/j.procs.2016.05.482","DOI":"10.1016\/j.procs.2016.05.482"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_021_w2aab3b7b4b1b6b1ab1ac21Aa","unstructured":"Nambiar, R.O. and Poess, M. (2006). The making of TPC-DS, Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, pp. 1049\u20131058."},{"key":"2023050302353040807_j_amcs-2019-0005_ref_022_w2aab3b7b4b1b6b1ab1ac22Aa","doi-asserted-by":"crossref","unstructured":"Simitsis, A., Vassiliadis, P. and Sellis, T. (2005). State-space optimization of ETL workflows, IEEE Transactions on Knowledge and Data Engineering17(10): 1404\u20131419.10.1109\/TKDE.2005.169","DOI":"10.1109\/TKDE.2005.169"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_023_w2aab3b7b4b1b6b1ab1ac23Aa","doi-asserted-by":"crossref","unstructured":"Simitsis, A., Wilkinson, K., Dayal, U. and Castellanos, M. (2010). Optimizing ETL workflows for fault-tolerance, IEEE 26th International Conference on Data Engineering (ICDE), Long Beach, CA, USA, pp. 385\u2013396.","DOI":"10.1109\/ICDE.2010.5447816"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_024_w2aab3b7b4b1b6b1ab1ac24Aa","doi-asserted-by":"crossref","unstructured":"Thomsen, C. and Pedersen, T.B. (2011). Easy and effective parallel programmable ETL, Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 37\u201344.","DOI":"10.1145\/2064676.2064684"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_025_w2aab3b7b4b1b6b1ab1ac25Aa","doi-asserted-by":"crossref","unstructured":"Tziovara, V., Vassiliadis, P. and Simitsis, A. (2007). Deciding the physical implementation of ETL workflows, Proceedings of the International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 49\u201356.","DOI":"10.1145\/1317331.1317341"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_026_w2aab3b7b4b1b6b1ab1ac26Aa","doi-asserted-by":"crossref","unstructured":"Vassiliadis, P., Simitsis, A. and Baikousi, E. (2009). A taxonomy of ETL activities, Proceedings of the ACM 12th International Workshop on Data Warehousing and OLAP, New York, NY, USA, pp. 25\u201332.","DOI":"10.1145\/1651291.1651297"},{"key":"2023050302353040807_j_amcs-2019-0005_ref_027_w2aab3b7b4b1b6b1ab1ac27Aa","doi-asserted-by":"crossref","unstructured":"Weinberg, A.I. and Last, M. (2017). Interpretable decision-tree induction in a big data parallel framework, International Journal of Applied Mathematics and Computer Science27(4): 737\u2013748, DOI: 10.1515\/amcs-2017-0051.10.1515\/amcs-2017-0051","DOI":"10.1515\/amcs-2017-0051"}],"container-title":["International Journal of Applied Mathematics and Computer Science"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/content.sciendo.com\/view\/journals\/amcs\/29\/1\/article-p69.xml","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.sciendo.com\/pdf\/10.2478\/amcs-2019-0005","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,29]],"date-time":"2024-02-29T10:29:04Z","timestamp":1709202544000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.sciendo.com\/article\/10.2478\/amcs-2019-0005"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,3,1]]},"references-count":27,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,3,29]]},"published-print":{"date-parts":[[2019,3,1]]}},"alternative-id":["10.2478\/amcs-2019-0005"],"URL":"https:\/\/doi.org\/10.2478\/amcs-2019-0005","relation":{},"ISSN":["2083-8492"],"issn-type":[{"value":"2083-8492","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,3,1]]}}}