Automated Taxonomy Induction and its Applications

Gupta, Amit

doi:10.5075/epfl-thesis-8160

doctoral thesis

Automated Taxonomy Induction and its Applications

2017

Machine-readable semantic knowledge in the form of taxonomies (i.e., a collection of is-a edges) has proved to be beneficial in an array of NLP tasks including inference, textual entailment, question answering and information extraction. Such widespread utility of taxonomies has led to multiple large-scale manual efforts towards taxonomy induction such as WordNet and Cyc. However, manual construction of taxonomies is time-intensive, and usually, requires substantial annotation efforts by domain experts. Furthermore, the resulting taxonomies suffer from low coverage and are unavailable for specific domains or languages. Therefore, in recent years, there has been a growing body of work, which aims to induce taxonomies automatically, either from semi-structured knowledge resources or unstructured text.

In this thesis, we focus on the task of automated taxonomy induction under a variety of different settings. We first focus on taxonomy induction from the largest semi-structured knowledge resource, i.e., Wikipedia. More specifically, we introduce a set of novel heuristics aimed towards inducing a large-scale taxonomy from the English Wikipedia categories network. We also propose a novel comprehensive path-based evaluation framework for taxonomies. Taxonomy induced using our approach significantly outperforms the state of the art across both edge-based as well as path-based evaluation metrics. Moreover, our experiments also demonstrate that high accuracy in edge-based evaluation metrics does not always translate to high-accuracy in path-based evaluation metrics. Subsequently, we propose a novel approach, which leverages the interlanguage links of Wikipedia to induce taxonomies in other Wikipedia languages. Compared to the state of the art, our approach is simpler, more principled, and results in taxonomies that are significantly more accurate across both edge-based and path-based evaluation metrics.

In the second part of this thesis, we focus on taxonomy induction from unstructured text. We propose a novel approach towards taxonomy induction from an input vocabulary of seed terms. Unlike previous approaches, which extract singular hypernym edges for terms, our approach utilizes a novel probabilistic framework to extract long-range hypernym subsequences. Taxonomy induction from the extracted subsequences is cast as an instance of the minimum-cost flow optimization problem on a carefully-designed flow network. Through experiments, we demonstrate that our approach outperforms the state of the art across four languages. We also show that our approach is robust to the presence of noise in the input vocabulary. Our approach facilitates the relaxation of many simplifying assumptions, which were employed by previous taxonomy induction approaches. As a result, our work serves to automate the process of taxonomy induction from unstructured text in the true sense.

Finally, we introduce a task of discovering and generalizing lexicalized templates from the titles of Wikipedia entities. The experimental results on this task demonstrate that taxonomies, which perform better on our proposed path-based metrics, result in a more accurate set of generalizations for a given set of entities.

In summary, this thesis proposes new approaches towards automated taxonomy induction. It improves upon the state of the art in a variety of different settings. It also relaxes many simplifying assumptions that limited the applicability of prior approaches.

Name

EPFL_TH8160.pdf

Access type

openaccess

Size

4.56 MB

Format

Adobe PDF

Checksum (MD5)

dc3a2925a6976aaffd6308b38e05a6a0