iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/978-3-540-78246-9_71
New Issues in Near-duplicate Detection | SpringerLink
Skip to main content

New Issues in Near-duplicate Detection

  • Conference paper
Data Analysis, Machine Learning and Applications

Abstract

Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web.

Our paper presents both an integrative view as well as new aspects from the field of nearduplicate detection: (i) Principles and Taxonomy. Identification and discussion of the principles behind the known algorithms for near-duplicate detection, (ii) Corpus Linguistics. Presentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicate detection algorithms. The corpus is public and may serve as a starting point for a standardized collection in this field. (iii) Analysis and Evaluation. Comparison of state-of-the-art algorithms for near-duplicate detection with respect to their retrieval properties. This analysis goes beyond existing surveys and includes recent developments from the field of hash-based search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  • BERNSTEIN, Y. and ZOBEL, J. (2004): A scalable system for identifying co-derivative documents, Proc. of SPIRE ’04.

    Google Scholar 

  • BRIN, S., DAVIS, J. and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents, Proc. of SIGMOD ’95.

    Google Scholar 

  • BRODER, A. (2000): Identifying and filtering near-duplicate documents, Proc. of COM ’00.

    Google Scholar 

  • BRODER, A., EIRON, N., FONTOURA, M., HERSCOVICI, M., LEMPEL, R., MCPHERSON, J., QI, R. and SHEKITA, E. (2006): Indexing Shared Content in Information Retrieval Systems, Proc. of EDBT ’06.

    Google Scholar 

  • CHARIKAR, M. (2002): Similarity Estimation Techniques from Rounding Algorithms, Proc. of STOC ’02.

    Google Scholar 

  • CHOWDHURY, A., FRIEDER, O., GROSSMAN, D. and MCCABE, M. (2002): Collection statistics for fast duplicate document detection, ACM Trans. Inf. Syst.,20.

    Google Scholar 

  • CONRAD, J., GUO, X. and SCHRIBER, C. (2003): Online duplicate document detection: signature reliability in a dynamic retrieval environment, Proc. of CIKM ’03.

    Google Scholar 

  • CONRAD, J. and SCHRIBER, C. (2004): Constructing a text corpus for inexact duplicate detection, Proc. of SIGIR ’04.

    Google Scholar 

  • DATAR, M., IMMORLICA, N., INDYK, P. and MIRROKNI, V. (2004): Locality-Sensitive Hashing Scheme Based on p-Stable Distributions, Proc. of SCG ’04.

    Google Scholar 

  • FETTERLY, D., MANASSE, M. and NAJORK, M. (2003): On the Evolution of Clusters of Near-Duplicate Web Pages, Proc. of LA-WEB ’03.

    Google Scholar 

  • FORMAN, G., ESHGHI, K. and CHIOCCHETTI, S. (2005): Finding similar files in large document repositories, Proc. of KDD ’05.

    Google Scholar 

  • HEINTZE, N. (1996): Scalable document fingerprinting, Proc. of USENIX-EC ’96.

    Google Scholar 

  • HENZINGER, M. (2006): Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms, Proc. of SIGIR ’06.

    Google Scholar 

  • HOAD, T. and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents, Jour. of ASIST, 54.

    Google Scholar 

  • INDYK, P. and MOTWANI, R. (1998): Approximate Nearest Neighbor—Towards Removing the Curse of Dimensionality, Proc. of STOC ’98.

    Google Scholar 

  • KOĐCZ, A., CHOWDHURY, A. and ALSPECTOR, J. (2004): Improved robustness of signature-based near-replica detection via lexicon randomization, Proc. of KDD ’04.

    Google Scholar 

  • MANBER, U. (1994): Finding similar files in a large file system, Proc. of USENIX-TC ’94

    Google Scholar 

  • SCHLEIMER, S., WILKERSON, D. and AIKEN, A. (2003): Winnowing: local algorithms for document fingerprinting, Proc. of SIGMOD ’03.

    Google Scholar 

  • STEIN, B. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval, Proc. of I-KNOW ’05.

    Google Scholar 

  • STEIN, B. (2007): Principles of Hash-based Text Retrieval, Proc. of SIGIR ’07.

    Google Scholar 

  • WEBER, R., SCHEK, H. and BLOTT, S. (1998): A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, Proc. of VLDB ’98.

    Google Scholar 

  • YE, S., WEN, J. and MA, W. (2006): A Systematic Study of Parameter Correlations in Large Scale Duplicate Document Detection, Proc. of PAKDD ’06.

    Google Scholar 

  • ZOBEL, J. and BERNSTEIN, Y. (2006): The case of the duplicate documents: Measurement, search, and science, Proc. of APWeb ’06.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Potthast, M., Stein, B. (2008). New Issues in Near-duplicate Detection. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_71

Download citation

Publish with us

Policies and ethics