word clustering

A graph-based Gaussian component clustering approach to unsupervised acoustic modeling

2014

Haipeng Wang Tan Lee Cheung-Chi Leung Bin Ma Haizhou Li

This paper describes a new approach to unsupervised acoustic modeling, that is to build acoustic models for phoneme-like sub-word units from untranscribed speech data. The proposed approach is based on Gaussian component clustering. Initially a large set of Gaussian components are estimated from the untranscribed data. Then clustering is performed to group these Gaussian components into differe...

متن کامل

Word Clustering for Data Sparsity: A Literature Survey

2013

Kashyap Popat

In this report, we present the literature survey done for our work with SA and other NLP applications. The road map of this report is as follows. In Section-1, we introduce clustering process and describe a few existing word clustering techniques. Section-2 talks about the smoothing process followed by why clustering is better for our work in Section-3. Finally in Section-4, we talk about the r...

متن کامل

SOM-based Document Image Retrieval

2005

Stefano Faini Simone Marinai Emanuele Marino Giovanni Soda

In this paper we discuss some applications of word image clustering (based on Self Organizing Maps, SOM) for tasks related to document image retrieval. Two main applications are discussed: document retrieval and word retrieval. In document retrieval a document representation based on the vector model is obtained by computing the occurrences of words belonging to the SOM clusters in each documen...

متن کامل

Iterative Constrained Clustering for Subjectivity Word Sense Disambiguation

2014

Cem Akkaya Janyce Wiebe Rada Mihalcea

Subjectivity word sense disambiguation (SWSD) is a supervised and applicationspecific word sense disambiguation task disambiguating between subjective and objective senses of a word. Not surprisingly, SWSD suffers from the knowledge acquisition bottleneck. In this work, we use a “cluster and label” strategy to generate labeled data for SWSD semiautomatically. We define a new algorithm called It...

متن کامل

Unsupervised Parts-of-Speech Induction for Bengali

2008

Joy Deep Nath Monojit Choudhury Animesh Mukherjee Christian Biemann Niloy Ganguly

We present a study of the word interaction networks of Bengali in the framework of complex networks. The topological properties of these networks reveal interesting insights into the morpho-syntax of the language, whereas clustering helps in the induction of the natural word classes leading to a principled way of designing POS tagsets. We compare different network construction techniques and cl...

متن کامل

Enhanced word classing for model M

2010

Stanley F. Chen Stephen M. Chu

Model M is a superior class-based n-gram model that has shown improvements on a variety of tasks and domains. In previous work with Model M, bigram mutual information clustering has been used to derive word classes. In this paper, we introduce a new word classing method designed to closely match with Model M. The proposed classing technique achieves gains in speech recognition word-error rate o...

متن کامل

A partitioning based algorithm to fuzzy co-cluster documents and words

Journal: :Pattern Recognition Letters 2006

William-Chandra Tjhi Lihui Chen

In this paper, a new algorithm fuzzy co-clustering with Ruspini s condition (FCR) is proposed for co-clustering documents and words. Compared to most existing fuzzy co-clustering algorithms, FCR is able to generate fuzzy word clusters that capture the natural distribution of words, which may be beneficial for information retrieval. We discuss the principle behind the algorithm through some theo...

متن کامل

Efficient Word Retrieval by Means of SOM Clustering and PCA

2006

Simone Marinai Stefano Faini Emanuele Marino Giovanni Soda

We propose an approach for efficient word retrieval from printed documents belonging to Digital Libraries. The approach combines word image clustering (based on Self Organizing Maps, SOM) with Principal Component Analysis. The combination of these methods allows us to efficiently retrieve the matching words from large documents collections without the need for a direct comparison of the query w...

متن کامل

RNN language model with word clustering and class-based output layer

Journal: :EURASIP J. Audio, Speech and Music Processing 2013

Yongzhe Shi Weiqiang Zhang Jia Liu Michael T. Johnson

The recurrent neural network language model (RNNLM) has shown significant promise for statistical language modeling. In this work, a new class-based output layer method is introduced to further improve the RNNLM. In this method, word class information is incorporated into the output layer by utilizing the Brown clustering algorithm to estimate a class-based language model. Experimental results ...

متن کامل

Developing the Persian version of the homophone meaning generation test

Journal: Medical Journal of Islamic Republic of Iran 2016

Hassan Ashayeri, Mohammad Kamali, Mohammad Reza Motamed, Mona Ebrahimipour Ebrahimipour, Yahya Modarresi,

Background: Finding the right word is a necessity in communication, and its evaluation has always been a challenging clinical issue, suggesting the need for valid and reliable measurements. The Homophone Meaning Generation Test (HMGT) can measure the ability to switch between verbal concepts, which is required in word retrieval. The purpose of this study was to adapt and validate the Persian ve...

متن کامل