Abstract
Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
A synset in WordNet groups a set of synonyms and a gloss corresponding to a word sense (i.e. concept).
Occurrence of the word dog is 204 millions, while canis is 2 millions, computed from Bing (Nov. 9th, 2008).
Bing search engine (http://www.bing.com).
References
Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137–146). Yokohama, Japan.
Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43.
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.
Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360–369). Mexico City, Mexico.
Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13–47.
Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). New Jersey, USA.
Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370–383.
Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652–659). New York: ACM.
Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111–134). Amsterdam: Elsevier.
Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733–2739).
Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711–722.
Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91–134.
Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369–381).
Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.
Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3–15). Trento: IOS Press.
Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48–54.
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19–33), Japan.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188–207.
Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). http://cpl.revues.org/document471.html. Accessed 26 May 2009.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296–304). San Francisco: Kaufmann.
Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303–308). Morristown: Association for Computational Linguistics.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.
Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1–8). Trento, Italy.
Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288–299.
Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17–30.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448–453).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.
Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729–733).
Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.
Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).
Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317–330.
Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557–570.
Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs — Geological Society of America, 36(5), 149.
Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491–499). Freiburg, Germany.
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133–138). New Mexico, USA.
Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189–196). Cambridge, MA.
Acknowledgements
This research has been partially supported by the Spanish Government within projects ARES (CONSOLIDER-INGENIO 2010 CSD2007-00004) and E-AEGIS (TSI2007-65406-C03-02). The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005). Montserrat Batet is also supported by a research grant provided by the Universitat Rovira i Virgili.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez, D., Batet, M., Valls, A. et al. Ontology-driven web-based semantic similarity. J Intell Inf Syst 35, 383–413 (2010). https://doi.org/10.1007/s10844-009-0103-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-009-0103-x