word clustering

Improving Sentence Similarity Measurement by Incorporating Sentential Word Importance

2010

Andrew Skabar Khaled Abdalgader

Measuring similarity between sentences plays an important role in textual applications such as document summarization and question answering. While various sentence similarity measures have recently been proposed, these measures typically only take into account word importance by virtue of inverse document frequency (IDF) weighting. IDF values are based on global information compiled over a lar...

متن کامل

Arabic-English Semantic Word Class Alignment to Improve Statistical Machine Translation

2015

Ines Turki Khemakhem Salma Jamoussi Abdelmajid Ben Hamadou

Clustering words is a widely used technique in statistical natural language processing. It requires syntactic, semantic, and contextual features. Especially, semantic clustering is gaining a lot of interest. It consists in grouping a set of words expressing the same idea or sharing the same semantic properties. In this paper, we present a new method to integrate semantic classes in a Statistica...

متن کامل

Document Clustering Method Based on Frequent Co-occurring Words

2006

Yehang Zhu Guanzhong Dai Benjamin C. M. Fung Dejun Mu

This paper presents a new document clustering method based on frequent co-occurring words. We first employ the Singular Value Decomposition, and then group the words into clusters called word representatives as substitution of the corresponding words in the original documents. Next, we extract the frequent word representative sets by Apriori. Subsequently, each document is designated to a basic...

متن کامل

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

2011

Valentin I. Spitkovsky Hiyan Alshawi Angel X. Chang Daniel Jurafsky

We show that categories induced by unsupervised word clustering can surpass the performance of gold part-of-speech tags in dependency grammar induction. Unlike classic clustering algorithms, our method allows a word to have different tags in different contexts. In an ablative analysis, we first demonstrate that this context-dependence is crucial to the superior performance of gold tags — requir...

متن کامل

Context - Dependent Conflation , Text Filtering and Clustering

2004

Ronald N. Kostoff Ronald Kostoff Joel Block

The presence of trivial words in text databases can impact record or concept (words/ phrases) clustering adversely. Additionally, the determination of whether a word/ phrase is trivial is context-dependent. The objective of the present paper is to demonstrate a context-dependent trivial word filter to improve clustering quality. Factor analysis was used as a context-dependent trivial word filte...

متن کامل

Incorporating Word Clustering into Complex Noun Phrase Identification

2015

Lihua Xue Guiping Zhang Qiaoli Zhou Na Ye

Since the professional technical literature include amounts of complex noun phrases, identifying those phrases has an important practical value for such tasks as machine translation. Through analysis of those phrases in Chinese-English bilingual sentence pairs from the aircraft technical publications, we present an annotation specification based on the existing specification to label those phra...

متن کامل

A Model for Word Clustering

Journal: :JASIS 1992

James A. Thom Justin Zobel

It is common to model the distribution of words in text by measures such as the Poisson approximation. However, these measures ignore effects such as clustering: our analysis of document collections demonstrates that the Poisson approximation can significantly overestimate the probability that a document contains a word. Based on our analysis, we propose a new model for distribution of words in...

متن کامل

Layered Dialog and Word Clustering

2010

Tynan Smith

The Restaurant Game is part of a project to develop an AI system that can play a video game with a human or another AI just by using annotated recordings of humans playing the game as examples. The Restaurant Game is a simple two-player restaurant simulation in which character are instructed to act out a typical interaction between a customer and a waitress. We have collected about 10,000 recor...

متن کامل

Clustering Paraphrases by Word Sense

2016

Anne Cocos Chris Callison-Burch

Automatically generated databases of English paraphrases have the drawback that they return a single list of paraphrases for an input word or phrase. This means that all senses of polysemous words are grouped together, unlike WordNet which partitions different senses into separate synsets. We present a new method for clustering paraphrases by word sense, and apply it to the Paraphrase Database ...

متن کامل

Distributional Word Clustering in Parallel

2006

Alan L. Ritter James W. Hearne Philip A. Nelson

We discuss various methods which have been applied to grouping words into syntactic and semantic categories, primarily how they deal with the problems of sparsity and computational complexity. We then present a method of distributional clustering, and discuss the parallelization of the most computationally intensive part of this process.

متن کامل