Synonyms
Deduplication in Data Cleaning; Duplicate detection; Entity resolution; Instance identification; Merge-purge; Name matching; Record linkage
Definition
Record matching is the problem of identifying whether two records in a database refer to the same real-world entity. For example, in Fig. 1, the customer record A1 in Table A and record B1 in Table B probably refer to the same customer, and should therefore be matched. (The example in Fig. 1 was adapted from an example in [21].) As Fig. 1 suggests, the same entity can be encoded in different ways in a database; this phenomenon is fairly common and occurs due to a variety of natural reasons such as different formatting conventions, abbreviations, and typographic errors. Record matching is often studied in the following setting: Given two relations A and B, identify all pairs of matching records, one from each relation. For the two tables in Fig. 1, a reasonable output might be the pairs (A1, B1) and (A2, B2). In some settings of...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Arasu A, Chaudhuri S, Kaushik R Transformation-based framework for record matching. In: Proceedings of the 24th International Conference on Data Engineering; 2008. p. 40–9.
Arasu A, Ganti V, Kaushik R. Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases; 2006. p. 918–29.
Bilenko M, Mooney, RJ. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2004. p. 39–48.
Chaudhuri S, Chen B.C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases; 2007. p. 327–38.
Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 313–24.
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering; 2006.
Cochinwala M, Kurien V, Lalk G, Shasha D. Efficient data reconciliation. Inf Sci. 2001;137(1–4):1–15.
Cohen WW. Data integration using similarity joins and a word-based information representation language. ACM Trans Inf Syst. 2000;18(3):288–321.
Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: a survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.
Felligi IP, Sunter AB. A theory for record linkage. J Am Stat Soc. 1969;64(328):1183–210.
Hernandez M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1995. p. 127–38.
Jaro MA. Unimatch: a record linkage system: user’s manual. Technical Report. Washington, DC: US Bureau of the Census; 1976.
Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida J Am Stat Assoc. 1989;84(406):414–20.
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2006. p. 802–3.
McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 169–78.
Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science. 1959;130(3381):954–9.
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 269–78.
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 743–54.
Torra V, Domingo-Ferrer J. Record linkage methods for multidatabase data mining. In: Torra V, editor. Information fusion in data mining. Springer; 2003. p. 101–32.
Winkler W. Improved decision rules in the felligi-sunter model of record linkage. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1993.
Winkler W. The state of record linkage and current research problems. Technical Report. Washington, DC: Statistical Research Division/US Bureau of the Census; 1999.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Arasu, A., Domingo-Ferrer, J. (2018). Record Matching. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_594
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_594
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering