arabic e text

Transforming Standard Arabic to Colloquial Arabic

2012

Emad Mohamed Behrang Mohit Kemal Oflazer

We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabi...

متن کامل

Unsupervised Stemmer for Arabic Tweets

2016

Fahad Albogamy Allan Ramsay

Stemming is an essential processing step in a wide range of high level text processing applications such as information extraction, machine translation and sentiment analysis. It is used to reduce words to their stems. Many stemming algorithms have been developed for Modern Standard Arabic (MSA). Although Arabic tweets and MSA are closely related and share many characteristics, there are substa...

متن کامل

On the Optical Character Recognition and Machine Translation Technology in Arabic: Problems and Solutions

2011

Oleg Redkin Olga Bernikova

The report addresses the basic problems of the Arabic language formalization based on analysis of linguistic errors in software products. Reviewing the principles of modern information systems operation the authors come to the conclusion that the existing methods of the Arabic formalization allow to note a shift towards the technological aspects of the linguistic processing of facts, however, t...

متن کامل

An Enhanced Arabic OCR Degraded Text Retrieval Model

2013

Mostafa Ezzat Tarek Elghazaly Mervat Gheith

This paper provides a new model enhancing the Arabic OCR degraded text retrieval effectiveness. The proposed model based on simulating the Arabic OCR recognition mistakes on a word based approach. Then the model expands the user search query using the expected OCR errors. The resulting expanded search query gives higher precision and recall in searching Arabic OCR-Degraded text rather than the ...

متن کامل

ICDAR2015 Writer Identification Competition using KHATT, AHTID/MW and IBHC Databases

2015

Chayan Halder

Handwriting is considered to be one of the commonly used modality to identify persons in commercial, governmental and forensic applications. In order to record recent advances in the field of writer identification, we are proposing to organize the ICDAR2015 writer identification competition using KHATT, AHTID/MW and IBHC Databases. A first edition of the Arabic Writer Identification Competition...

متن کامل

Learning for transliteration of arabic-numeral expressions using decision tree for Korean TTS

2004

Youngim Jung Donghun Lee HyeonSook Nam Ae-sun Yoon Hyuk-Chul Kwon

Despite of much work on TTS technologies and several TTS systems customized for Korean, current TTS systems output many errors in transliterating non-alphabetic symbols such as Arabic numerals and text symbols. This paper proposes TLAN (Transliteration Learner for Arabic-Numeral expressions) which can efficiently disambiguate the reading and meaning of Arabic Numeral Expressions (ANEs) in texts...

متن کامل

A Comparative Study on Arabic Text Classification

Journal: :Egyptian Computer Science Journal 2008

Alaa El-Halees

This paper focuses on Automatic Arabic classifications. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In classifying Arabic text, there are many published experimental results. Since these results came from different datasets, authors and evaluation metrics, we cannot compare the performance of the experimented classifiers. In this pape...

متن کامل

Clitics in Arabic Language: A Statistical Study

2010

Fahad Alotaiby Salah Foda Ibrahim Alkharashi

Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the re...

متن کامل

Topic Modeling of Phonetic Latin-Spelled Arabic for the Relative Analysis of Genre-Dependent and Dialect-Dependent Variation

2012

Ali Sakr Mark Hasegawa-Johnson

We demonstrate a data collection and analysis system that can be used to analyze the relative contributions of dialect dependent variation in the lexical of speech-like Arabic text. We utilize Latent Dirichlet Allocation (LDA), a generative Probabilistic modeling method, to analyze a phonetic Latin Spelled Arabic online chat corpus. The corpus produces different word choices and word relations ...

متن کامل

English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text

2009

Tarek A. Elghazaly

In this paper, a novel for Query Translation and Expansion for enabling English/Arabic CLIR for both normal and OCR-Degraded Arabic Text model has been proposed, implemented, and tested. First, an English/Arabic Word Collocations Dictionary has been established plus reproducing three English/Arabic Single Words Dictionaries. Second, a modern Arabic Corpus has been built. Third, a model for simu...

متن کامل