Abstract
Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web.
Our paper presents both an integrative view as well as new aspects from the field of nearduplicate detection: (i) Principles and Taxonomy. Identification and discussion of the principles behind the known algorithms for near-duplicate detection, (ii) Corpus Linguistics. Presentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicate detection algorithms. The corpus is public and may serve as a starting point for a standardized collection in this field. (iii) Analysis and Evaluation. Comparison of state-of-the-art algorithms for near-duplicate detection with respect to their retrieval properties. This analysis goes beyond existing surveys and includes recent developments from the field of hash-based search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
BERNSTEIN, Y. and ZOBEL, J. (2004): A scalable system for identifying co-derivative documents, Proc. of SPIRE ’04.
BRIN, S., DAVIS, J. and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents, Proc. of SIGMOD ’95.
BRODER, A. (2000): Identifying and filtering near-duplicate documents, Proc. of COM ’00.
BRODER, A., EIRON, N., FONTOURA, M., HERSCOVICI, M., LEMPEL, R., MCPHERSON, J., QI, R. and SHEKITA, E. (2006): Indexing Shared Content in Information Retrieval Systems, Proc. of EDBT ’06.
CHARIKAR, M. (2002): Similarity Estimation Techniques from Rounding Algorithms, Proc. of STOC ’02.
CHOWDHURY, A., FRIEDER, O., GROSSMAN, D. and MCCABE, M. (2002): Collection statistics for fast duplicate document detection, ACM Trans. Inf. Syst.,20.
CONRAD, J., GUO, X. and SCHRIBER, C. (2003): Online duplicate document detection: signature reliability in a dynamic retrieval environment, Proc. of CIKM ’03.
CONRAD, J. and SCHRIBER, C. (2004): Constructing a text corpus for inexact duplicate detection, Proc. of SIGIR ’04.
DATAR, M., IMMORLICA, N., INDYK, P. and MIRROKNI, V. (2004): Locality-Sensitive Hashing Scheme Based on p-Stable Distributions, Proc. of SCG ’04.
FETTERLY, D., MANASSE, M. and NAJORK, M. (2003): On the Evolution of Clusters of Near-Duplicate Web Pages, Proc. of LA-WEB ’03.
FORMAN, G., ESHGHI, K. and CHIOCCHETTI, S. (2005): Finding similar files in large document repositories, Proc. of KDD ’05.
HEINTZE, N. (1996): Scalable document fingerprinting, Proc. of USENIX-EC ’96.
HENZINGER, M. (2006): Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms, Proc. of SIGIR ’06.
HOAD, T. and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents, Jour. of ASIST, 54.
INDYK, P. and MOTWANI, R. (1998): Approximate Nearest Neighbor—Towards Removing the Curse of Dimensionality, Proc. of STOC ’98.
KOĐCZ, A., CHOWDHURY, A. and ALSPECTOR, J. (2004): Improved robustness of signature-based near-replica detection via lexicon randomization, Proc. of KDD ’04.
MANBER, U. (1994): Finding similar files in a large file system, Proc. of USENIX-TC ’94
SCHLEIMER, S., WILKERSON, D. and AIKEN, A. (2003): Winnowing: local algorithms for document fingerprinting, Proc. of SIGMOD ’03.
STEIN, B. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval, Proc. of I-KNOW ’05.
STEIN, B. (2007): Principles of Hash-based Text Retrieval, Proc. of SIGIR ’07.
WEBER, R., SCHEK, H. and BLOTT, S. (1998): A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, Proc. of VLDB ’98.
YE, S., WEN, J. and MA, W. (2006): A Systematic Study of Parameter Correlations in Large Scale Duplicate Document Detection, Proc. of PAKDD ’06.
ZOBEL, J. and BERNSTEIN, Y. (2006): The case of the duplicate documents: Measurement, search, and science, Proc. of APWeb ’06.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Potthast, M., Stein, B. (2008). New Issues in Near-duplicate Detection. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_71
Download citation
DOI: https://doi.org/10.1007/978-3-540-78246-9_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)