Linguistic Dumpster Diving: Geographical Classification of Arabic Text

نویسندگان

  • Ron Zacharski
  • Ahmed Abdelali
  • Stephen Helmreich
  • Jim Cowie
  • Mary Washington
چکیده

In many applied natural language processing tasks, information is thrown out. For example, in speech recognition systems, prosodic information is commonly discarded; in information retrieval systems, a document is commonly treated as an unordered bag of words and syntactic information is thrown out; and in machine translation systems, pragmatic information (e.g., topic-comment structure and referents of anaphoric expressions) is commonly discarded. Perhaps the most common discarded linguistic forms are the frequent words of a language—words such as those shown in figure 1.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

High capacity steganography tool for Arabic text using 'Kashida'

Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...

متن کامل

Arabic Text Classification Algorithm using TFIDF and Chi Square Measurements

Text categorization is the process of classifying documents into a predefined set of categories based on its contents of keywords. Text classification is an extended type of text categorization where the text is further categorized into sub-categories. Many algorithms have been proposed and implemented to solve the problem of English text categorization and classification. However, few studies ...

متن کامل

Comparative Assessment of the Performance of Three WEKA Text Classifiers Applied to Arabic Text

This research is conducted in order to compare the performance of three known text classification techniques namely, Support Vector Machine (SVM) classifier, Naïve Bayes (NB) classifier, and C4.5 Classifier. Text classification aims to automatically assign the text to a predefined category based on linguistic features, and content. These three techniques are compared using a set of Arabic text ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008