Projecting Corpus-Based Semantic Links on a Thesaurus
نویسنده
چکیده
Hypernym links acquired through an information extraction procedure are projected on multi-word terms through the recognition of semantic variations. The quality of the projected links resulting from corpus-based acquisition is compared with projected links extracted from a technical thesaurus. 1 Motivation In the domain of corpus-based terminology, there are two main topics of research: term acquisition-the discovery of candidate terms-and automatic thesaurus construction-the addition of semantic links to a term bank. Several studies have focused on automatic acquisi-The output of these tools is a list of unstructured multi-word terms. On the other hand, contributions to automatic construction of thesauri provide classes or links between single words. Classes are produced by clustering techniques based on similar word contexts (Schiitze, 1993) or similar distributional contexts (Grefenstette, 1994). Links result from automatic acquisition of relevant predicative or discursive pat-Predicative patterns yield predicative relations such as cause or effect whereas discursive patterns yield non-predicative relations such as generic/specific or synonymy links. * The experiments presented in this paper were performed on [AGRO], a 1.3-million word French corpus of scientific abstracts in the agricultural domain. The ter-mer used for multi-word term acquisition is ACABIT (Daille, 1996). It has produced 15,875 multi-word terms composed of 4,194 single words. For expository purposes , some examples are taken from [MEDIC], a 1.56-million word English corpus of scientific abstracts in the medical domain. The main contribution of this article is to bridge the gap between term acquisition and thesaurus construction by offering a framework for organizing multi-word candidate terms with the help of automatically acquired links between single-word terms. Through the extraction of semantic variants, the semantic links between single words are projected on multi-word candidate terms. As shown in Figure 1, the input to the system is a tagged corpus. A partial ontology between single word terms and a set of multi-word candidate terms are produced after the first step. In a second step, layered hierarchies of multi-word terms are constructed through corpus-based conflation of semantic variants. Even though we focus here on generic/specific relations, the method would apply similarly to any other type of semantic relation. The study is organized as follows. First, the method for corpus-based acquisition of semantic links is presented. Then, the tool for semantic term normalization is described together with its application to semantic link projection. The last section analyzes the results on an agricultural corpus and evaluates the quality of the induced …
منابع مشابه
Automatic Acquisition and Expansion of Hypernym Links
Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links betw...
متن کاملText Relatedness Based on a Word Thesaurus
The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure...
متن کاملCorpus+WordNet thesaurus generation for ontology enriching
This paper presents a model to enrich an ontology with a thesaurus based on a domain corpus and WordNet. The model is applied to the data privacy domain and the initial domain resources comprise a data privacy ontology, a corpus of privacy laws, regulations and guidelines for projects. Based on these resources, a thesaurus is automatically generated. The thesaurus seeds are composed by the onto...
متن کاملAutomatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis
based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, wh...
متن کاملLexical Semantic Relatedness with Random Graph Walks
Many systems for tasks such as question answering, multi-document summarization, and information retrieval need robust numerical measures of lexical relatedness. Standard thesaurus-based measures of word pair similarity are based on only a single path between those words in the thesaurus graph. By contrast, we propose a new model of lexical semantic relatedness that incorporates information fro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999