Degraded Text Recognition Using Word Collocation and Visual Inter-Word Constraints

نویسندگان

  • Tao Hong
  • Jonathan J. Hull
چکیده

Given a noisy text page, a word recognizer can generate a set of candidates for each word image. A relaxation algorithm was proposed previously by the authors that uses word collocation statistics to select the candidate for each word that has the highest probability of being the correct decision. Because word collocation is a local constraint and collocation data trained from corpora are usually incomplete, the algorithm cannot select the correct candidates for some images. To overcome this limitation, contextual information at the image level is now exploited inside the relaxation algorithm. If two word images can match with each other, they should have same symbolic identity. Visual inter-word relations provide a way to link word images in the text and to interpret them systematically. By integrating visual inter-word constraints with word collocation data, the performance of the relaxation algorithm is improved. I n t r o d u c t i o n Word collocation is one source of information that has been proposed as a useful tool to post-process word recognition results([1, 4]). It can be considered as a constraint on candidate selection so that the word candidate selection problem can be formalized as an instance of constraint satisfaction. Relaxation is a typical method for constraint satisfaction problems. One of the advantages of relaxation is that it can achieve a global effect by using local constraints. Previously, a probabilistic relaxation algorithm was proposed for word candidate re-evaluation and selection([2]). The basic idea of the algorithm is to use word collocation constraints to select the word candidates that have a high probability of occurring simultaneously with word candidates at other nearby locations. The algorithm runs iteratively. In each iteration, the probability of each word candidate is upgraded based on its previous probability, the probabilities of its neighbors and word collocation data. The initial probability of each word candidate is provided by a word recognizer. The relaxation process terminates when the probability of each word candidate becomes stable. After relaxation finishes, for each word image, the word candidate with highest probabilistic score will be selected as the decision word. Because the window size of word collocation is usually small, word collocation is a local constraint. Because word collocation data are derived from text corpora, it usually is incomplete and unbalanced. Those properties limit the usefulness of word collocation for candidate selection. By analyzing the performance of the algorithm, three sources of errors were identified: (1). the local context cannot provide enough information to distinguish the competitive candidates; (2). word collocation data trained from corpora are not complete so that it does not include the statistical data needed to select the correct candidate; and (3). word collocation data trained from unbalanced corpora are biased so that the wrong candidate is selected. In a normal English text, there are many occurrences of the same words. Because the main body of a text is usually prepared in the same font type, different occurrences of the same word are visually similar even if the text image is highly degraded. Visual similarity between word images can place useful constraints on the process of candidate selection([3]). If two word images can match with each other, their identities should be the same. For example, if there are two sentences, "Please fill in the application X " and "This Y is almost the same as that one", where X and Y are visually similar, and both of them have the candidate set { farm, form } . The candidate "form" can be easily selected as the decision for X and Y if we consider both word collocation and visual inter-word constraints, although it is difficult to select a candidate for Y by only using word collocation. Modi f i ed R e l a x a t i o n A l g o r i t h m Figure 1 is the description of the new relaxation algorithm that integrates word collocation and visuM interword constraints for candidate selection. Given a sequence of word images from a text page, the first step of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integration of Visual Inter-Word Constraints and Linguistic Knowledge in Degraded Text Recognition

Degraded text recognition is a di cult task. Given a noisy text image, a word recognizer can be applied to generate several candidates for each word image. Highlevel knowledge sources can then be used to select a decision from the candidate set for each word image. In this paper, we propose that visual inter-word constraints can be used to facilitate candidate selection. Visual inter-word const...

متن کامل

Degraded text recognition using word collocation

A relaxation-based algorithm is proposed that improves the performance of a text recognition technique by propagating the in uence of word collocation statistics. Word collocation refers to the likelihood that two words co-occur within a xed distance of one another. For example, in a story about water transportation, it is highly likely that the word \river" will occur within ten words on eithe...

متن کامل

Character segmentation using visual interword constraints in a text page

Character segmentation is a critical preprocessing step for text recognition. In this paper a method is presented that utilizes visual inter-word constraints available in a text image to split word images into smaller image pieces. This method is applicable to machine-printed texts in which the same spacing is always used between identical pairs of characters. The visual inter-word constraints ...

متن کامل

Algorithms for postprocessing OCR results with visual inter-word constraints

Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equi...

متن کامل

Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

0957-4174/$ see front matter 2013 Elsevier Ltd. A http://dx.doi.org/10.1016/j.eswa.2013.01.003 ⇑ Corresponding author. Tel.: +852 27887756; fax: E-mail addresses: [email protected] (P. T (T.W.S. Chow). An effective algorithm for extracting two useful features from text documents for analyzing word collocation habits, ‘‘Frequency Rank Ratio’’ (FRR) and ‘‘Intimacy’’, is proposed. FRR is deriv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994