Multilingual OCR (MOCR): An Approach to Classify Words to Languages

نویسندگان

Mohammad Abu Obaida

Md. Jakir Hossain

Momotaz Begum

Md. Shahin Alam

چکیده

There are immense efforts to design a complete OCR for most of the world’s leading languages, however, multilingual documents either of handwritten or of printed form. As a united attempt, Unicode based OCRs were studied mostly with some positive outcomes, despite the fact that a large character set slows down the recognition significantly. In this paper, we come out with a method to classify words to a language as the word segmentation is complete. For the purpose, we identified the characteristics of writings of several languages and utilized projecting method combined with some other feature extraction methods. In addition, this paper intends a modified statistical approach to correct the skewness before processing a segmented document. The proposed procedure, evaluated for a collection of both handwritten and printed documents, came with excellent outcomes in assigning words to languages. General Terms Pattern Recognition, Document Processing, Optical Character Recognition.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Bible , Truth , and Multilingual OCR

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is diicult to compare the performance of these OCR algorithms across languages. This diiculty arises because most evaluation methodologies rely on the use of a do...

متن کامل

The Bible, truth, and multilingual OCR evaluation

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for crosslanguage information access. While many research groups and companies have developed OCR algorithms for various languages, it is di cult to compare the performance of these OCR algorithms across languages. This di culty arises because most evaluation methodologies rely on the use of a doc...

متن کامل

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text w...

متن کامل

Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR

Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using...

متن کامل

On Separation of English Numerals from Multilingual Document Images

For Optical Character Recognition (OCR) of bilingual or multilingual document containing text words in regional language and numerals in English, it is necessary to identify different script forms before running an individual OCR of the scripts. In this paper, an attempt is made for separation of English numerals at word level from bilingual and trilingual documents representing Kannada, Devnag...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Multilingual OCR (MOCR): An Approach to Classify Words to Languages

نویسندگان

چکیده

منابع مشابه

The Bible , Truth , and Multilingual OCR

The Bible, truth, and multilingual OCR evaluation

Discrimination of English to other Indian languages (Kannada and Hindi) for OCR system

Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR

On Separation of English Numerals from Multilingual Document Images

عنوان ژورنال:

اشتراک گذاری