Degraded Text Recognition Using Word Collocation and Visual Inter-Word Constraints
نویسندگان
چکیده
Given a noisy text page, a word recognizer can generate a set of candidates for each word image. A relaxation algorithm was proposed previously by the authors that uses word collocation statistics to select the candidate for each word that has the highest probability of being the correct decision. Because word collocation is a local constraint and collocation data trained from corpora are usually incomplete, the algorithm cannot select the correct candidates for some images. To overcome this limitation, contextual information at the image level is now exploited inside the relaxation algorithm. If two word images can match with each other, they should have same symbolic identity. Visual inter-word relations provide a way to link word images in the text and to interpret them systematically. By integrating visual inter-word constraints with word collocation data, the performance of the relaxation algorithm is improved. I n t r o d u c t i o n Word collocation is one source of information that has been proposed as a useful tool to post-process word recognition results([1, 4]). It can be considered as a constraint on candidate selection so that the word candidate selection problem can be formalized as an instance of constraint satisfaction. Relaxation is a typical method for constraint satisfaction problems. One of the advantages of relaxation is that it can achieve a global effect by using local constraints. Previously, a probabilistic relaxation algorithm was proposed for word candidate re-evaluation and selection([2]). The basic idea of the algorithm is to use word collocation constraints to select the word candidates that have a high probability of occurring simultaneously with word candidates at other nearby locations. The algorithm runs iteratively. In each iteration, the probability of each word candidate is upgraded based on its previous probability, the probabilities of its neighbors and word collocation data. The initial probability of each word candidate is provided by a word recognizer. The relaxation process terminates when the probability of each word candidate becomes stable. After relaxation finishes, for each word image, the word candidate with highest probabilistic score will be selected as the decision word. Because the window size of word collocation is usually small, word collocation is a local constraint. Because word collocation data are derived from text corpora, it usually is incomplete and unbalanced. Those properties limit the usefulness of word collocation for candidate selection. By analyzing the performance of the algorithm, three sources of errors were identified: (1). the local context cannot provide enough information to distinguish the competitive candidates; (2). word collocation data trained from corpora are not complete so that it does not include the statistical data needed to select the correct candidate; and (3). word collocation data trained from unbalanced corpora are biased so that the wrong candidate is selected. In a normal English text, there are many occurrences of the same words. Because the main body of a text is usually prepared in the same font type, different occurrences of the same word are visually similar even if the text image is highly degraded. Visual similarity between word images can place useful constraints on the process of candidate selection([3]). If two word images can match with each other, their identities should be the same. For example, if there are two sentences, "Please fill in the application X " and "This Y is almost the same as that one", where X and Y are visually similar, and both of them have the candidate set { farm, form } . The candidate "form" can be easily selected as the decision for X and Y if we consider both word collocation and visual inter-word constraints, although it is difficult to select a candidate for Y by only using word collocation. Modi f i ed R e l a x a t i o n A l g o r i t h m Figure 1 is the description of the new relaxation algorithm that integrates word collocation and visuM interword constraints for candidate selection. Given a sequence of word images from a text page, the first step of
منابع مشابه
Integration of Visual Inter-Word Constraints and Linguistic Knowledge in Degraded Text Recognition
Degraded text recognition is a di cult task. Given a noisy text image, a word recognizer can be applied to generate several candidates for each word image. Highlevel knowledge sources can then be used to select a decision from the candidate set for each word image. In this paper, we propose that visual inter-word constraints can be used to facilitate candidate selection. Visual inter-word const...
متن کاملDegraded text recognition using word collocation
A relaxation-based algorithm is proposed that improves the performance of a text recognition technique by propagating the in uence of word collocation statistics. Word collocation refers to the likelihood that two words co-occur within a xed distance of one another. For example, in a story about water transportation, it is highly likely that the word \river" will occur within ten words on eithe...
متن کاملCharacter segmentation using visual interword constraints in a text page
Character segmentation is a critical preprocessing step for text recognition. In this paper a method is presented that utilizes visual inter-word constraints available in a text image to split word images into smaller image pieces. This method is applicable to machine-printed texts in which the same spacing is always used between identical pairs of characters. The visual inter-word constraints ...
متن کاملAlgorithms for postprocessing OCR results with visual inter-word constraints
Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equi...
متن کاملRecognition of word collocation habits using frequency rank ratio and inter-term intimacy
0957-4174/$ see front matter 2013 Elsevier Ltd. A http://dx.doi.org/10.1016/j.eswa.2013.01.003 ⇑ Corresponding author. Tel.: +852 27887756; fax: E-mail addresses: [email protected] (P. T (T.W.S. Chow). An effective algorithm for extracting two useful features from text documents for analyzing word collocation habits, ‘‘Frequency Rank Ratio’’ (FRR) and ‘‘Intimacy’’, is proposed. FRR is deriv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994