The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

نویسندگان

  • Jason J. S. Chang
  • Shun-der Chen
چکیده

The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition rate is the topic of many recent researches. Recently, the related research on Chinese nature language processing has improved rapidly. These improvement include the Chinese word segmentation, syntax analysis, semantic analysis, collocation analysis, statistical language models. In this paper, we will propose a new techniques for Chinese OCR postprocessing and postediting. We combine noisy channel model and the technique of natural language processing to implement an OCR postprocessing system. From the result of experiments, we found noisy channel model very effective for postprocessing. Under the approach, it is possible to recover the correct character, even when it is not in the candidate list produced by the OCR system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generative Probabilistic OCR Model for NLP Applications

In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in o...

متن کامل

Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System

Arabic spelling errors occur in different types of documents, such as handwritten by non experienced users, optical character recognition (OCR) documents and machine translated documents. Many researchers had tried to solve this dilemma but till now there is no a radical solution. This paper proposes a hybrid system based on the confusion matrix and the noisy channel spelling correction model t...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Comparison of Two Contextual Post - Processing Algorithms for Text Recog

The binary n-gra~ and Viterbi algorithms are alternative approaches to con-te.xtual post-processing for text produced by a noisy channel such as an optical character recognizer. The paper describes the underlying theory of each approach in unified terminology, presents a storage efficient data structure for the binary n-gram algorithm and a recursive formulation for the viterbi algorithm. Relat...

متن کامل

A Robust Text Processing Technique Applied to Lexical Error Recovery

This thesis addresses automatic lexical error recovery and tokenization of corrupt text input. We propose a technique that can automatically correct mis-spellings, segmentation errors and real-word errors in a unified framework that uses both a model of language production and a model of the typing behavior, and which makes tokenization part of the recovery process. The typing process is modele...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995