Cross-lingual Text Classification Using Topic-Dependent Word Probabilities

نویسندگان

Daniel Andrade

Kunihiko Sadamasa

Akihiro Tamura

Masaaki Tsuchida

چکیده

Cross-lingual text classification is a major challenge in natural language processing, since often training data is available in only one language (target language), but not available for the language of the document we want to classify (source language). Here, we propose a method that only requires a bilingual dictionary to bridge the language gap. Our proposed probabilistic model allows us to estimate translation probabilities that are conditioned on the whole source document. The assumption of our probabilistic model is that each document can be characterized by a distribution over topics that help to solve the translation ambiguity of single words. Using the derived translation probabilities, we then calculate the expected word frequency of each word type in the target language. Finally, these expected word frequencies can be used to classify the source text with any classifier that was trained using only target language documents. Our experiments confirm the usefulness of our proposed method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix cofactorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matric...

متن کامل

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingua...

متن کامل

Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation

The paper reports on recent experiments in cross-lingual document processing (with a case study for Bulgarian-English-Romanian language pairs) and brings evidence on the benefits of using linguistic ontologies for achieving, with a high level of accuracy, difficult tasks in NLP such as word alignment, word sense disambiguation, document classification, cross-language information retrieval, etc....

متن کامل

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical wor...

متن کامل

Structural Correspondence Learning for Cross-Lingual Sentiment Classification with One-to-Many Mappings

Structural correspondence learning (SCL) is an effective method for cross-lingual sentiment classification. This approach uses unlabeled documents along with a word translation oracle to automatically induce task specific, cross-lingual correspondences. It transfers knowledge through identifying important features, i.e., pivot features. For simplicity, however, it assumes that the word translat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Cross-lingual Text Classification Using Topic-Dependent Word Probabilities

نویسندگان

چکیده

منابع مشابه

Learning Cross-lingual Word Embeddings via Matrix Co-factorization

Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications

Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

Structural Correspondence Learning for Cross-Lingual Sentiment Classification with One-to-Many Mappings

عنوان ژورنال:

اشتراک گذاری