document image

Extraction of Original Text Document from a Set of Degraded Text Documents from the Same Source

2016

Navya Prakash

Information extraction is the task of extracting structured data from a degraded document. It includes data extraction such as text, image or graphics from the sources such as an image, video or documents. Text detection and extraction from the degraded document finds application in wide range of study. In this paper, an Optical Character Recognition less (OCR-less) method of obtaining an origi...

متن کامل

Content-Based Similar Document Image Retrieval Using Fusion of CNN Features

2017

Mao Tan Siping Yuan Yongxin Su

Rapid increase of digitized document give birth to high demand of document image retrieval. While conventional document image retrieval approaches depend on complex OCR-based text recognition and text similarity detection, this paper proposes a new content-based approach, in which more attention is paid to features extraction and fusion. In the proposed approach, multiple features of document i...

متن کامل

OCR accuracy improvement on document images through a novel pre-processing approach

Journal: :CoRR 2015

Abdeslam El Harraj Naoufal Raissouni

Digital camera and mobile document image acquisition are new trends arising in the world of Optical Character Recognition and text detection. In some cases, such process integrates many distortions and produces poorly scanned text or text-photo images and natural images, leading to an unreliable OCR digitization. In this paper, we present a novel nonparametric and unsupervised method to compens...

متن کامل

Automatic generation of character groundtruth for scanned documents: a closed-loop approach

1996

Tapas Kanungo Robert M. Haralick

Character groundtruth for scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii)...

متن کامل

Knowledge-based derivation of document logical structure

1995

Debashish Niyogi Sargur N. Srihari

The analysis of a document image to derive a symbolic description of its structure and contents involves using spatial domain knowledge to classify the different printed blocks (e.g., text paragraphs), group them into logical units (e.g., newspaper stories), and determine the reading order of the text blocks within each unit. These steps describe the conversion of the physical structure of a do...

متن کامل

Techniques for Line Drawing Interpretation: An Overview (Invited)

1990

Rangachar Kasturi Senthil Siva Lawrence O'Gorman

An overview is presented of algorithms and techniques for document image analysis with an emphasis on those for grnphics recognition and interpretation. The techniques are derived from the fields of image processing. pattern recognition, and machine vision. The objective in document image analysis is to recognize page contents including layout, text, and figures. Although optical character reco...

متن کامل

Degraded Script Identification for Indian Language- A Survey

2014

Manoj Kumar Shukla Haider Banka S. N. Srihari C. Y. Suen R. Legault C. Nadal M. Cheriet

The working module of any Optical character Recognition system almost depends upon printing and paper of the input document image. A number of OCR techniques are available and claim correctly identified accuracy in printed document image in Indian and foreign script. A few report have been found on the recognition of the degraded Indian language document. The degradation in any scanned printed ...

متن کامل

An Approach to Word Image Matching Based on Weighted Hausforff Distance

2001

Yue Lu Chew Lim Tan Weihua Huang Liying Fan

An approach to word image matching based on weighted Hausdorff distance(WHD) is proposed in this paper to facilitate the detection and location of the user-specified words in the document images. Preprocessing such as eliminating the space between adjacent characters in the word images and scale normalization is first done before the WHD is utilized to measure the distance between the template ...

متن کامل

Adaptive Image Contrast with Binarization Technique for Degraded Document Image

2014

M. Tamilselvi

----------------------------------------------------ABSTRACT--------------------------------------------------Segmentation of text from badly degraded document images is very challenging tasks due to the high inter/intra variation between the document background and the foreground text of different document images. In this paper, we propose a novel document image binarization technique that add...

متن کامل

Language Independent Single Document Image Super-Resolution using CNN for improved recognition

Journal: :CoRR 2017

Ram Krishna Pandey A. G. Ramakrishnan

Recognition of document images have important applications in restoring old and classical texts. The problem involves quality improvement before passing it to a properly trained OCR to get accurate recognition of the text. The image enhancement and quality improvement constitute important steps as subsequent recognition depends upon the quality of the input image. There are scenarios when high ...

متن کامل