iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1145/2695664.2695757
Adaptive sorted neighborhood blocking for entity matching with MapReduce | Proceedings of the 30th Annual ACM Symposium on Applied Computing skip to main content
10.1145/2695664.2695757acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Adaptive sorted neighborhood blocking for entity matching with MapReduce

Published: 13 April 2015 Publication History

Abstract

Cloud computing has proven to be a powerful ally to efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from the cloud computing paradigm have become an important demand nowadays. In this paper, we investigate how the MapReduce programming model can be used to perform efficient parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose MapReduce Duplicate Count Strategy (MR--DCS ++), an efficient MapReduce-based approach for the adaptive SNM, aiming to increase even more the performance of SNM. The evaluation results based on real-world datasets and cloud infrastructure show that our approach increases the performance of MapReduce-based SNM by providing better results for the EM execution time.

References

[1]
Apache hadoop. http://hadoop.apache.org/.
[2]
P. Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated, 2012.
[3]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. on Knowl. and Data Eng., 24(9):1537--1555, Sept. 2012.
[4]
G. Dal Bianco, R. Galante, and C. A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC '11, pages 1027--1032, New York, NY, USA, 2011. ACM.
[5]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.
[6]
U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg. Adaptive windows for duplicate detection. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE '12, pages 1073--1083, Washington, DC, USA, 2012. IEEE Computer Society.
[7]
S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. Parallel and Distributed Computing 2014, page 3, 2014.
[8]
T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Kopcke, and E. Rahm. Data Partitioning for Parallel Entity Matching. In 8th International Workshop on Quality in Databases, 2010.
[9]
L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, ICDE12, pages 618--629, Washington, DC, USA, 2012. IEEE Computer Society.
[10]
L. Kolb, A. Thor, and E. Rahm. Multi-pass sorted neighborhood blocking with mapreduce. Comput. Sci., 27(1):45--63, Feb. 2012.
[11]
D. G. Mestre and C. E. Pires. Improving load balancing for mapreduce-based entity matching. In Proceedings of the Eighteenth IEEE Symposium on Computers and Communications, ISCC'13, pages 618--624. IEEE Computer Society, 2013.
[12]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD '11, pages 949--960, New York, NY, USA, 2011. ACM.
[13]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 495--506, New York, NY, USA, 2010. ACM.
[14]
C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 1119--1122, New York, NY, USA, 2010. ACM.
[15]
S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pages 185--194. ACM, 2007.

Cited By

View all
  • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
  • (2019)A User Profile Analysis Framework Driven by Distributed Machine Learning for Big DataProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349431(358-363)Online publication date: 12-Jul-2019
  • (2019)Schema-Agnostic Progressive Entity ResolutionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285276331:6(1208-1221)Online publication date: 1-Jun-2019
  • Show More Cited By

Index Terms

  1. Adaptive sorted neighborhood blocking for entity matching with MapReduce

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
        April 2015
        2418 pages
        ISBN:9781450331968
        DOI:10.1145/2695664
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 April 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. MapReduce
        2. adaptive
        3. entity matching
        4. sorted neighborhood method

        Qualifiers

        • Research-article

        Conference

        SAC 2015
        Sponsor:
        SAC 2015: Symposium on Applied Computing
        April 13 - 17, 2015
        Salamanca, Spain

        Acceptance Rates

        SAC '15 Paper Acceptance Rate 291 of 1,211 submissions, 24%;
        Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
        • (2019)A User Profile Analysis Framework Driven by Distributed Machine Learning for Big DataProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349431(358-363)Online publication date: 12-Jul-2019
        • (2019)Schema-Agnostic Progressive Entity ResolutionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.285276331:6(1208-1221)Online publication date: 1-Jun-2019
        • (2019)Exploiting block co-occurrence to control block sizes for entity resolutionKnowledge and Information Systems10.1007/s10115-019-01347-0Online publication date: 26-Mar-2019
        • (2018)Performance Analysis of Hadoop Cluster for User Behavior Analysis2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2018.00135(805-809)Online publication date: Jun-2018
        • (2018)A Review of Unsupervised and Semi-supervised Blocking Methods for Record LinkageLinking and Mining Heterogeneous and Multi-view Data10.1007/978-3-030-01872-6_4(79-105)Online publication date: 27-Nov-2018
        • (2017)Stream-based live entity resolution approach with adaptive duplicate count strategyInternational Journal of Web and Grid Services10.1504/IJWGS.2017.08516713:3(351-373)Online publication date: 1-Jan-2017
        • (2016)Applying machine learning techniques for scaling out data quality algorithms in cloud computing environmentsApplied Intelligence10.1007/s10489-016-0774-245:2(530-548)Online publication date: 2-Apr-2016
        • (2016)Data Quality Monitoring of Cloud Databases Based on Data Quality SLAsBig-Data Analytics and Cloud Computing10.1007/978-3-319-25313-8_1(3-20)Online publication date: 13-Jan-2016

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media