word clustering

Word clustering for a word bi-gram model

1998

Shinsuke Mori Masafumi Nishimura Nobuyasu Itoh

In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus di erent from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Tre...

متن کامل

On Modeling Sense Relatedness in Multi-prototype Word Embedding

2017

Yixin Cao Jiaxin Shi Juan-Zi Li Zhiyuan Liu Chengjiang Li

To enhance the expression ability of distributional word representation learning model, many researchers tend to induce word senses through clustering, and learn multiple embedding vectors for each word, namely multi-prototype word embedding model. However, most related work ignores the relatedness among word senses which actually plays an important role. In this paper, we propose a novel appro...

متن کامل

'Fighting' or 'Conflict'? An Approach to Revealing Concepts of Terms in Political Discourse

2017

Linyuan Tang Kyo Kageura

Previous work on the epistemology of fact-checking indicated the dilemma between the needs of binary answers for the public and ambiguity of political discussion. Determining concepts represented by terms in political discourse can be considered as a Word-Sense Disambiguation (WSD) task. The analysis of political discourse, however, requires identifying precise concepts of terms from relatively...

متن کامل

K-means and Graph-based Approaches for Chinese Word Sense Induction Task

2010

Lisha Wang Yanzhao Dou Xiaoling Sun Hongfei Lin

This paper details our experiments carried out at Word Sense Induction task. For the foreign language (especially English), there have been many studies of word sense induction (WSI), and the approaches and the techniques are more and more mature. However, the study of Chinese WSI is just getting started, and there has not been a better way to solve the problems encountered. WSI can be divided ...

متن کامل

Unsupervised Word Sense Induction using Distributional Statistics

2014

Kartik Goyal Eduard H. Hovy

Word sense induction is an unsupervised task to find and characterize different senses of polysemous words. This work investigates two unsupervised approaches that focus on using distributional word statistics to cluster the contextual information of the target words using two different algorithms involving latent dirichlet allocation and spectral clustering. Using a large corpus for achieving ...

متن کامل

Scalable Object Discovery: A Hash-Based Approach to Clustering Co-occurring Visual Words

Journal: :IEICE Transactions 2011

Gibran Fuentes Pineda Hisashi Koga Toshinori Watanabe

We present a scalable approach to automatically discovering particular objects (as opposed to object categories) from a set of images. The basic idea is to search for local image features that consistently appear in the same images under the assumption that such co-occurring features underlie the same object. We first represent each image in the set as a set of visual words (vector quantized lo...

متن کامل

Latent semantic sentence clustering for multi-document summarization

2011

Johanna Geiß

This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) t...

متن کامل

A Comparative Study of Word Co-occurrence for Term Clustering in Language Model-based Sentence Retrieval

2010

Saeedeh Momtazi Sanjeev Khudanpur Dietrich Klakow

Sentence retrieval is a very important part of question answering systems. Term clustering, in turn, is an effective approach for improving sentence retrieval performance: the more similar the terms in each cluster, the better the performance of the retrieval system. A key step in obtaining appropriate word clusters is accurate estimation of pairwise word similarities, based on their tendency t...

متن کامل

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

2014

Xinying Song Xiaodong He Jianfeng Gao Li Deng

Deep neural network (DNN) based natural language processing models rely on a word embedding matrix to transform raw words into vectors. Recently, a deep structured semantic model (DSSM) has been proposed to project raw text to a continuously-valued vector for Web Search. In this technical report, we propose learning word embedding using DSSM. We show that the DSSM trained on large body of text ...

متن کامل

Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction

Journal: :Computational Linguistics 2013

Antonio Di Marco Roberto Navigli

Web search result clustering aims to facilitate information search on the Web. Rather than the results of a query being presented as a flat list, they are grouped on the basis of their similarity and subsequently shown to the user as a list of clusters. Each cluster is intended to represent a different meaning of the input query, thus taking into account the lexical ambiguity (i.e., polysemy) i...

متن کامل