lexical clusters

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

2013

Olutobi Owoputi Brendan T. O'Connor Chris Dyer Kevin Gimpel Nathan Schneider Noah A. Smith

We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (m...

متن کامل

Fast algorithm for speech recognition using speaker cluster HMM

1997

Masayuki Yamada Yasuhiro Komori Tetsuo Kosaka Hiroki Yamamoto

This paper describes a high speed algorithm for a speech recognizer based on speaker cluster HMM. The speaker cluster HMM, which enables to deal with variety among speakers, have been reported to show good performance. However, the computation amount grows in proportion to the number of clusters, when the speaker cluster HMM is used in speaker independent recognition, where the recognition proc...

متن کامل

CST Bank: A Corpus for the Study of Cross-document Structural Relationships

2004

Dragomir R. Radev Jahna Otterbacher Zhu Zhang

Clusters of multiple news stories related to the same topic exhibit a number of interesting properties. For example, when documents have been published at various points in time or by different authors or news agencies, one finds many instances of paraphrasing, information overlap and even contradiction. The current paper presents the Cross-document Structure Theory (CST) Bank, a collection of ...

متن کامل

Semantic, Lexical, and Geographic Cues are used in Geographic Fluency

2016

Janelle Szary Michael N. Jones

Semantic fluency tasks have increasingly been used to probe the structure of human memory, adopting methodologies from the ecological foraging literature to describe memory as a trajectory through semantic space. Clusters of semantically related items are often produced together, and the transitions between these clusters of semantically related items are consistent with theories of optimal for...

متن کامل

A Language Independent Author Verifier Using Fuzzy C-Means Clustering

2014

Pashutan Modaresi Philipp Gross

In this work we describe our approach to solve the author verification problem introduced in the PAN 2014 Author Identification task. The author verification task presents participants with a set of problems where each problem consists of a set of documents written by the same author and a questioned document with an unknown author. The task is then to decide whether the questioned document has...

متن کامل

Simple Semi-Supervised POS Tagging

2015

Karl Stratos Michael Collins

We tackle the question: how much supervision is needed to achieve state-of-the-art performance in part-of-speech (POS) tagging, if we leverage lexical representations given by the model of Brown et al. (1992)? It has become a standard practice to use automatically induced “Brown clusters” in place of POS tags. We claim that the underlying sequence model for these clusters is particularly well-s...

متن کامل

Typological implications of Kalam predictable vowels*

2012

Juliette Blevins Andrew Pawley

Kalam is a Trans New Guinea language of Papua New Guinea. Kalam has two distinct vowel types: full vowels /a e o/, which are of relatively long duration and stressed, and reduced central vowels, which are shorter and often unstressed, and occur predictably within word-internal consonant clusters and in monoconsonantal utterances. The predictable nature of the reduced vowels has led earlier rese...

متن کامل

Distributed Distributional Similarities of Google Books over the Centuries

2014

Martin Riedl Richard Steuer Christian Biemann

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thes...

متن کامل

To Coerce or Not to Coerce: A Corpus-based Exploration of Some Complement Coercion Verbs in Chinese

2013

Chan-Chia Hsu

This study takes a corpus-based approach to examine twenty Chinese verbs that have been found to coerce their NP complements into an event type (cf. Lin et al. 2009), with an aim of creating a coercion profile for each verb. A cluster analysis is further conducted on the coercion profiles. The resulting clusters in our analysis show a bi-directional distribution: the verbs in Cluster 1 are foun...

متن کامل

Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms

2016

Guillaume Jacquet Maud Ehrmann Ralf Steinberger Jaakko Väyrynen

This paper reports on an approach and experiments to automatically build a cross-lingual multi-word entity resource. Starting from a collection of millions of acronym/expansion pairs for 22 languages where expansion variants were grouped into monolingual clusters, we experiment with several aggregation strategies to link these clusters across languages. Aggregation strategies make use of string...

متن کامل