Corpus-Based Learning Of Compound Noun Indexing

نویسندگان

  • Byung-Kwan Kwak
  • Jee-Hyub Kim
  • Gary Geunbae Lee
  • Jungyun Seo
چکیده

In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an e cient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows about the same performance compared with the manual linguistic approach but is more portable and requires no human e orts. We also evaluate the seven di erent ltering methods based on both the e ectiveness and the e ciency, and present a new method to solve the problems of compound noun over-generation and data sparseness in statistical compound noun processing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Korean Compound Noun Indexing Based on Lexical Association and Conceptual Association

Conventional methods have dealt with compound nouns with a goal to enhance the recall. That is, they've extracted only unit nouns within a full compound noun but didn't take out head modiiers which can improve the precision. This paper presents a new method of the Korean compound noun indexing which can improve both the recall and the precision. Our method extracts head modiiers from a compound...

متن کامل

Compound Noun Segmentation Based on Lexical Data Extracted from Corpus

Compound noun analysis is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without white space in real texts, which makes it difficult to identify the morphological constituents. This paper presents an effective method of Korean compound noun segmen-tation based on lexical data extracted from corpus. The segmentation is done by two steps: ...

متن کامل

Automatic Identification of Bengali Noun-Noun Compounds Using Random Forest

This paper presents a supervised machine learning approach that uses a machine learning algorithm called Random Forest for recognition of Bengali noun-noun compounds as multiword expression (MWE) from Bengali corpus. Our proposed approach to MWE recognition has two steps: (1) extraction of candidate multi-word expressions using Chunk information and various heuristic rules and (2) training the ...

متن کامل

Distributional Semantics Approach to Thai Word Sense Disambiguation

Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy...

متن کامل

Identification of Noun-Noun (N-N) Collocations as Multi-Word Expressions in Bengali Corpus

Noun-Noun compounds, as a subset of Compound Nouns as well as Nominal Compounds play an important role in NLP applications like Machine Translation, Information Retrieval because of the token frequency, type frequency and their occurrence in the world’s languages. Recognition of MWEs requires deep or shallow syntactic preprocessing tools and large corpora. The problem is quite difficult in Beng...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000