word clustering

Frequency Effects

2014

Ogyoung Lee Vsevolod Kapatsinski

This study accounts for Korean /n/-epenthesis from a usage-based perspective, by describing the reduced productivity of epenthesis as an analogical change in progress. We found that epenthesis probability rises as whole-word frequency increases, supporting the hypothesis that analogical change begins in lowfrequency words (Bybee 2002). We interpret the findings as support for the idea that freq...

متن کامل

Fuzzy Clustering Approach Using Data Fusion Theory and Its Application to Automatic Isolated Word Recognition

2004

B. Moshiri

In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the ...

متن کامل

Phrase Clustering Without Document Context

2006

Eric SanJuan Fidelia Ibekwe-Sanjuan

We applied different clustering algorithms to the task of clustering multi-word terms in order to reflect a humanly built ontology. Clustering was done without the usual document co-occurrence information. Our clustering algorithm, CPCL (Classification by Preferential Clustered Link) is based on general lexico-syntactic relations which do not require prior domain knowledge or the existence of a...

متن کامل

Decision trees for inter-word context dependencies in Spanish continuous speech recognition tasks

1999

Karmele López de Ipiña Amparo Varona M. Inés Torres Luis Javier Rodríguez-Fuentes

Context Dependent Units are broadly used in Continuous Speech Recognition (CSR) system, being decision trees a suitable clustering technique to obtain this kind of units. This work was aimed to extend the decision tree based clustering to model inter-word context dependencies in Spanish CSR tasks. We first used a set of previously defined context dependent units to model word boundaries. A deci...

متن کامل

Vector space representation of language probabilities through SVD of n-gram matrix

2000

Shiro Terashima Kazuya Takeda Fumitada Itakura

In this paper we introduce the vector space representation of the N-gram language model where vectors of K dimensions are given to both words and contexts, i.e., an N-1 word sequence, so that the scalar product of a ‘word vector’ and a ‘context vector’ gives the corresponding N-gram probability. The vector space representation is obtained from singular value decomposition (SVD) of the co-occurr...

متن کامل

A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification

Journal: :Journal of Machine Learning Research 2003

Inderjit S. Dhillon Subramanyam Mallela Rahul Kumar

High dimensionality of text can be a deterrent in applying complex learners such as Support Vector Machines to the task of text classification. Feature clustering is a powerful alternative to feature selection for reducing the dimensionality of text data. In this paper we propose a new informationtheoretic divisive algorithm for feature/word clustering and apply it to text classification. Exist...

متن کامل

Resolving Translation Ambiguity Using Non-Parallel Bilingual Corpora

1999

Genichiro Kikui

This paper presents an unsupervised method for choosing the correct translation of a word in context. It learns disambiguation information from nonparallel bilinguM corpora (preferably in the same domain) free from tagging. Our method combines two existing unsupervised disambiguation algorithms: a word sense disambiguation algorithm based on distributional clustering and a translation disambigu...

متن کامل

Word Clustering for Persian Statistical Parsing

2012

Masood Ghayoomi

Syntactically annotated data like a treebank are used for training the statistical parsers. One of the main aspects in developing statistical parsers is their sensitivity to the training data. Since data sparsity is the biggest challenge in data oriented analyses, parsers have a malperformance if they are trained with a small set of data, or when the genre of the training and the test data are ...

متن کامل

Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces

2004

Amruta Purandare Ted Pedersen

This paper systematically compares unsupervised word sense discrimination techniques that cluster instances of a target word that occur in raw text using both vector and similarity spaces. The context of each instance is represented as a vector in a high dimensional feature space. Discrimination is achieved by clustering these context vectors directly in vector space and also by finding pairwis...

متن کامل

Clustering signatures classify directed networks.

Journal: :Physical review. E, Statistical, nonlinear, and soft matter physics 2008

S E Ahnert T M A Fink

We use a clustering signature, based on a recently introduced generalization of the clustering coefficient to directed networks, to analyze 16 directed real-world networks of five different types: social networks, genetic transcription networks, word adjacency networks, food webs, and electric circuits. We show that these five classes of networks are cleanly separated in the space of clustering...

متن کامل