Labeling Unlabeled Data using Cross-Language Guided Clustering

نویسندگان

  • Sachindra Joshi
  • Danish Contractor
  • Sumit Negi
چکیده

The effort required to build a classifier for a task in a target language can be significantly reduced by utilizing the knowledge gained during an earlier effort of model building in a source language for a similar task. In this paper, we investigate whether unlabeled data in the target language can be labeled given the availability of labeled data for a similar domain in the source language. We view the problem of labeling unlabeled documents in the target language as that of clustering them such that the resulting partitioning has the best alignment with the classes provided in the source language. We develop a cross language guided clustering (CLGC) method to achieve this. We also propose a method to discover concept mapping between languages which is utilized by CLGC to transfer supervision across languages. Our experimental results show significant gains in the accuracy of labeling documents over the baseline methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CSAL: Self-adaptive Labeling based Clustering Integrating Supervised Learning on Unlabeled Data

Supervised classification approaches can predict labels for unknown data because of the supervised training process. The success of classification is heavily dependent on the labeled training data. Differently, clustering is effective in revealing the aggregation property of unlabeled data, but the performance of most clustering methods is limited by the absence of labeled data. In real applica...

متن کامل

A Comparison of Discriminative EM-Based Semi-Supervised Learning algorithms on Agreement/Disagreement classification

Recently, semi-supervised learning has been an active research topic in the natural language processing community, to save effort in hand-labeling for data-driven learning and to exploit a large amount of readily available unlabeled text. In this paper, we apply EM-based semi-supervised learning algorithms such as traditional EM, co-EM, and cross validation EM to the task of agreement/disagreem...

متن کامل

Improved Automatic Clustering Using a Multi-Objective Evolutionary Algorithm With New Validity measure and application to Credit Scoring

In data mining, clustering is one of the important issues for separation and classification with groups like unsupervised data. In this paper, an attempt has been made to improve and optimize the application of clustering heuristic methods such as Genetic, PSO algorithm, Artificial bee colony algorithm, Harmony Search algorithm and Differential Evolution on the unlabeled data of an Iranian bank...

متن کامل

Using Language Modeling to Select Useful Annotation Data

An annotation project typically has an abundant supply of unlabeled data that can be drawn from some corpus, but because the labeling process is expensive, it is helpful to pre-screen the pool of the candidate instances based on some criterion of future usefulness. In many cases, that criterion is to improve the presence of the rare classes in the data to be annotated. We propose a novel method...

متن کامل

Exploiting Debate Portals for Semi-Supervised Argumentation Mining in User-Generated Web Discourse

Analyzing arguments in user-generated Web discourse has recently gained attention in argumentation mining, an evolving field of NLP. Current approaches, which employ fully-supervised machine learning, are usually domain dependent and suffer from the lack of large and diverse annotated corpora. However, annotating arguments in discourse is costly, errorprone, and highly context-dependent. We ask...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011