post text

Statistical correlation analysis in image retrieval

Journal: :Pattern Recognition 2002

Mingjing Li Zheng Chen HongJiang Zhang

A statistical correlation model for image retrieval is proposed. This model captures the semantic relationships among images in a database from simple statistics of userprovided relevance feedback information. It is applied in the post-processing of image retrieval results such that more semantically related images are returned to the user. The algorithm is easy to implement and can be efficien...

متن کامل

External Query Expansion in the Blogosphere

2008

Wouter Weerkamp Maarten de Rijke

We describe the participation of the University of Amsterdam’s ILPS group in the blog track at TREC 2008. We mainly explored different ways of using external corpora to expand the original query. In the blog post retrieval task we did not succeed in improving over a simple baseline (equal weights for both the expanded and original query). Obtaining optimal weights for the original and the expan...

متن کامل

Multilingual number transcription for text-to-speech conversion

2013

Rubén San-Segundo-Hernández Juan Manuel Montero-Martínez Mircea Giurgiu Ioana Muresan Simon King

This paper describes the text normalization module of a text to speech fully-trainable conversion system and its application to number transcription. The main target is to generate a language independent text normalization module, based on data instead of on expert rules. This paper proposes a general architecture based on statistical machine translation techniques. This proposal is composed of...

متن کامل

Practical Web Crawling for Text Corpora

2011

Vit Suchomel Jan Pomikálek

SpiderLing—a web spider for linguistics—is new software for creating text corpora from the web, which we present in this article. Many documents on the web only contain material which is not useful for text corpora, such as lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unr...

متن کامل

Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation

Journal: :Pattern Recognition 1997

Gary Geunbae Lee Jong-Hyeok Lee JinHee Yoo

Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for high-quality character-recognition applications, and we need much higher-level contextual information to improve the recognition results. This paper present...

متن کامل

LIMSI$@$WMT'15 : Translation Task

2015

Benjamin Marie Alexandre Allauzen Franck Burlot Quoc-Khanh Do Julia Ive Elena Knyazeva Matthieu Labeau Thomas Lavergne Kevin Löser Nicolas Pécheux François Yvon

This paper describes LIMSI’s submissions to the shared WMT’15 translation task. We report results for French-English, Russian-English in both directions, as well as for Finnish-into-English. Our submissions use NCODE and MOSES along with continuous space translation models in a post-processing step. The main novelties of this year’s participation are the following: for Russian-English, we inves...

متن کامل

WMT ’ 15 : Translation Task

2015

Benjamin Marie Alexandre Allauzen Franck Burlot Quoc-Khanh Do Julia Ive Elena Knyazeva Matthieu Labeau Thomas Lavergne Kevin Löser Nicolas Pécheux François Yvon

This paper describes LIMSI’s submissions to the shared WMT’15 translation task. We report results for French-English, Russian-English in both directions, as well as for Finnish-into-English. Our submissions use NCODE and MOSES along with continuous space translation models in a post-processing step. The main novelties of this year’s participation are the following: for Russian-English, we inves...

متن کامل

Learning Morphology of Romance, Germanic and Slavic Languages with the Tool Linguistica

2010

Helena Blancafort

In this paper we present preliminary work conducted on semi-automatic induction of inflectional paradigms from non annotated corpora using the open-source tool Linguistica (Goldsmith 2001) that can be utilized without any prior knowledge of the language. The aim is to induce morphology information from corpora such as to compare languages and foresee the difficulty to develop morphosyntactic le...

متن کامل

Evaluating Memory Efficiency and Robustness of Word Embeddings

2016

Johannes Jurgovsky Michael Granitzer Christin Seifert

Skip-Gram word embeddings, estimated from large text corpora, have been shown to improve many NLP tasks through their highquality features. However, little is known about their robustness against parameter perturbations and about their e ciency in preserving word similarities under memory constraints. In this paper, we investigate three post-processing methods for word embeddings to study their...

متن کامل

TweetCaT: a tool for building Twitter corpora of smaller languages

2014

Nikola Ljubesic Darja Fiser Tomaz Erjavec

This paper presents TweetCaT, an open-source Python tool for building Twitter corpora that was designed for smaller languages. Using the Twitter search API and a set of seed terms, the tool identifies users tweeting in the language of interest together with their friends and followers. By running the tool for 235 days we tested it on the task of collecting two monitor corpora, one for Croatian ...

متن کامل