Improvements in Handwritten and Printed Text Separation in Historical Archival Documents

نویسندگان

چکیده

The presence of handwritten text and annotations combined with typewritten machine-printed in historical archival records make them visually complex, posing challenges for OCR systems accurately transcribing their content. This paper is an extension [1], reporting on improvements the separation from (including typewriters), by use FCN-based models trained datasets created different data synthesis pipelines. Results show a significant increase about 20% intrinsic evaluation artificial test sets, 8% improvement extrinsic subsequent task real documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discrimination between Printed and Handwritten Text in Documents

Recognition techniques for printed and handwritten text in scanned documents are significantly different. In this paper, we propose method to automatically identify the signature in the scanned document images. This helps to retrieve the document images based on the signature. A simple region growing algorithm is used to segment the document into a number of patches. A patch is composed of many...

متن کامل

Text-image alignment for historical handwritten documents

We describe our work on text-image alignment in context of building a historical document retrieval system. We aim at aligning images of words in handwritten lines with their text transcriptions. The images of handwritten lines are automatically segmented from the scanned pages of historical documents and then manually transcribed. To train automatic routines to detect words in an image of hand...

متن کامل

Handwritten Text Recognition for Historical Documents

The amount of digitized legacy documents has been rising dramatically over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents. The vast majority of them remain waiting to be transcribed into a textual electronic format (such as ASCII or PDF) that would provide historians and other researchers new ways of indexing, consulting and que...

متن کامل

Handwritten and Printed Text Separation in Real Document

The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudolines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseud...

متن کامل

Text Extraction from Historical Handwritten Documents by Edge Detection

Many national archives or libraries keep large amount of historical handwritten documents. One problem that many archivists are facing is the sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage. The result is that the handwritten characters from the reverse side appear as noise on the front side and even interfere with the front side char...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Archiving

سال: 2023

ISSN: ['2161-8798', '2168-3204']

DOI: https://doi.org/10.2352/issn.2168-3204.2023.20.1.7