historical simulation

A multi-scale framework for adaptive binarization of degraded document images

Journal: :Pattern Recognition 2010

Reza Farrahi Moghaddam Mohamed Cheriet

In this work, a multi-scale binarization framework is introduced, which can be used along with any adaptive threshold-based binarization method. This framework is able to improve the binarization results and to restore weak connections and strokes, especially in the case of degraded historical documents. This is achieved thanks to localized nature of the framework on the spatial domain. The fra...

متن کامل

Enriching Digitized Medieval Manuscripts: Linking Image, Text and Lexical Knowledge

2015

Aitor Arronte Alvarez

This paper describes an on-going project of transcribing and annotating digitized manuscripts of medieval Spanish with paleographic and lexical information. We link lexical units from the manuscripts with the Multilingual Central Repository (MCR), making terms retrievable by any of the languages that integrate MCR. The goal of the project is twofold: creating a paleographic knowledge base from ...

متن کامل

Textline detection in degraded historical document images

Journal: :EURASIP J. Image and Video Processing 2017

Byeongyong Ahn Jewoong Ryu Hyung Il Koo Nam Ik Cho

This paper presents a textline detection method for degraded historical documents. Our method follows a conventional two-step procedure that the binarization is first performed and then the textlines are extracted from the binary image. In order to address the challenges in historical documents such as document degradation, structure noise, and skews, we develop new methods for the binarization...

متن کامل

A line-based representation for matching words in historical manuscripts

Journal: :Pattern Recognition Letters 2011

Ethem Fatih Can Pinar Duygulu Sahin

0167-8655/$ see front matter 2011 Elsevier B.V. A doi:10.1016/j.patrec.2011.02.013 ⇑ Corresponding author. Tel.: +90 312 2903143; fax E-mail addresses: [email protected] (E.F. Ca (P. Duygulu). In this study, we propose a newmethod for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matchin...

متن کامل

A classification-free word-spotting system

2013

Nikos Vasilopoulos Ergina Kavallieratou

In this paper, a classification-free Word-Spotting system, appropriate for the retrieval of printed historical document images is proposed. The system skips many of the procedures of a common approach. It does not include segmentation, feature extraction or classification. Instead it treats the queries as compact shapes and uses image processing techniques in order to localize a query in the do...

متن کامل

Building a historical corpus for Classical Portuguese: some technological aspects

2006

Maria Clara Paixão de Sousa Thorsten Trippel

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of ...

متن کامل

Input sensitive thresholding for ancient Hebrew manuscript

Journal: :Pattern Recognition Letters 2005

Itay Bar Yosef

In this paper, we describe an input sensitive thresholding algorithm for ancient Hebrew calligraphy documents. Usually, historical document images are of poor quality since the documents have degraded over time due to storage conditions. However, the distribution of noise in one document is not uniform and the characters quality may vary. We develop tools to identify noisy characters and apply ...

متن کامل

An Unsupervised Model of Orthographic Variation for Historical Document Transcription

2016

Dan Garrette Hannah Alpert-Abrams

Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions with normalized orthography. In this paper, we present a novel joint transcription model that learns, unsupervised, a probabili...

متن کامل

Parsing the Past - Identification of Verb Constructions in Historical Text

2012

Eva Pettersson Beáta Megyesi Joakim Nivre

Even though NLP tools are widely used for contemporary text today, there is a lack of tools that can handle historical documents. Such tools could greatly facilitate the work of researchers dealing with large volumes of historical texts. In this paper we propose a method for extracting verbs and their complements from historical Swedish text, using NLP tools and dictionaries developed for conte...

متن کامل

The Gamera framework for building custom recognition systems

2003

Michael Droettboom Karl MacMillan Ichiro Fujinaga

This paper describes the Gamera framework for building custom document recognition systems. This open-source system is designed to support the testand-refine development cycle: an important style for developing recognition systems that work with difficult historical documents, since the solutions are often non-obvious. This paper explains the overall architecture of the system, in addition to d...

متن کامل