Evaluating Similarity Measures for Dataset Search

Wang, Xu; Huang, Zhisheng; van Harmelen, Frank

doi:10.1007/978-3-030-62008-0_3

Xu Wang¹³,
Zhisheng Huang¹³ &
Frank van Harmelen¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12343))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1370 Accesses
2 Citations

Abstract

Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like...”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier DataSearch, a commercial dataset search engine. With the help of F-measure evaluation measure and nDCG evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Dataset-Similarity-Based Approach to Select Datasets for Evaluation in Similarity Retrieval

A Taxonomy of Dataset Search

Open dataset discovery using context-enhanced similarity search

Article 04 September 2022

Notes

1.
https://datasearch.elsevier.com/.
2.
https://toolbox.google.com/datasetsearch.
3.
https://data.mendeley.com/.
4.
https://dumps.wikimedia.org/enwiki/.
5.
http://www.nlm.nih.gov/mesh.
6.
See also https://www.nlm.nih.gov/mesh/concept_structure.html.

References

Bauchner, H., Golub, R., Fontanarosa, P.: Data sharing: an ethical and scientific imperative. J. Am. Med. Assoc. 12(315), 1238–1240 (2016)
Article Google Scholar
Bobadilla, J., Ortega, F., Hernando, A., Gutiérrez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)
Article Google Scholar
Borgman, C.L., Wallis, J.C., Mayernik, M.S.: Who’s got the data? Interdependencies in science and technology collaborations. Comput. Supported Coop. Work (CSCW) 21(6), 485–523 (2012). https://doi.org/10.1007/s10606-012-9169-z
Article Google Scholar
Chinchor, N.: MUC-4 evaluation metrics. In: Proceedings of the 4th Conference on Message Understanding, MUC4 1992, pp. 22–29. Association for Computational Linguistics, New York (1992)
Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Article Google Scholar
Editorial: Benefits of sharing. Nature 530(7589), 129 (2016). https://doi.org/10.1038/530129a
Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd SIGIR Conference, SIGIR 2000, pp. 41–48. ACM, New York (2000)
Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002). https://doi.org/10.1145/582415.582418
Article Google Scholar
McNutt, M.: Data sharing. Science 351, 1007 (2016). https://doi.org/10.1126/science.aaf4545
Article Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA (2010)
Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. CoRR abs/cmp-lg/9511007 (1995). http://arxiv.org/abs/cmp-lg/9511007
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138. Association for Computational Linguistics (1994)
Google Scholar

Download references

Acknowledgements

This work has been funded by the Netherlands Science Foundation NWO grant nr. 652.001.002, it is co-funded by Elsevier B.V., with funding for the first author by the China Scholarship Council (CSC) grant number 201807730060. We are grateful to our colleagues in Elsevier for sharing their dataset, and to all of our colleagues in the Data Search project for their valuable input.

Author information

Authors and Affiliations

Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Xu Wang, Zhisheng Huang & Frank van Harmelen

Authors

Xu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhisheng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Frank van Harmelen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Wang .

Editor information

Editors and Affiliations

VU Amsterdam, Amsterdam, The Netherlands
Zhisheng Huang
VU Amsterdam, Amsterdam, The Netherlands
Wouter Beek
Victoria University, Melbourne, VIC, Australia
Hua Wang
Swinburne University of Technology, Hawthorn, VIC, Australia
Rui Zhou
Victoria University, Melbourne, VIC, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, X., Huang, Z., van Harmelen, F. (2020). Evaluating Similarity Measures for Dataset Search. In: Huang, Z., Beek, W., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2020. WISE 2020. Lecture Notes in Computer Science(), vol 12343. Springer, Cham. https://doi.org/10.1007/978-3-030-62008-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-62008-0_3
Published: 21 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62007-3
Online ISBN: 978-3-030-62008-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluating Similarity Measures for Dataset Search

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Dataset-Similarity-Based Approach to Select Datasets for Evaluation in Similarity Retrieval

A Taxonomy of Dataset Search

Open dataset discovery using context-enhanced similarity search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Evaluating Similarity Measures for Dataset Search

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The Dataset-Similarity-Based Approach to Select Datasets for Evaluation in Similarity Retrieval

A Taxonomy of Dataset Search

Open dataset discovery using context-enhanced similarity search

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation