Categorization of Large Text Collections: Feature Selection for Training Neural Networks

نویسندگان

  • Pensiri Manomaisupat
  • Bogdan Vrusias
  • Khurshid Ahmad
چکیده

Automatic text categorization requires the construction of appropriate surrogates for documents within a text collection. The surrogates, often called document vectors, are used to train learning systems for categorising unseen documents. A comparison of different measures (tfidf and weirdness) for creating document vectors is presented together with two different state-of-theart classifiers: supervised Kohonen’s SOFM and unsupervised Vapniak’s SVM. The methods are tested using two ‘gold standard’ document collections and one data set from a ‘real-world’ news stream. There appears to be an optimal size both for the of document vectors and for the dimensionality of each vector that gives the best compromise between categorization accuracy and training time. The performance of each of the classifiers was computed for five different surrogate vector models: the first two surrogates were created with tfidf and weirdness measures accordingly, the third surrogate was created purely on the basis of high-frequency words in the training corpus, and the fourth vector model was created from a standardised terminology database. Finally, the fifth surrogate (used for evaluation purposes) was based on a random selection of words from the training corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introducing a method for extracting features from facial images based on applying transformations to features obtained from convolutional neural networks

In pattern recognition, features are denoting some measurable characteristics of an observed phenomenon and feature extraction is the procedure of measuring these characteristics. A set of features can be expressed by a feature vector which is used as the input data of a system. An efficient feature extraction method can improve the performance of a machine learning system such as face recognit...

متن کامل

NTC (Neural Text Categorizer): Neural Network for Text Categorization

This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge d...

متن کامل

Effective Feature Selection for Pre-Cancerous Cervix Lesions Using Artificial Neural Networks

Since most common form of cervical cancer starts with pre-cancerous changes, a flawless detection of these changes becomes an important issue to prevent and treat the cervix cancer. There are 2 ways to stop this disease from developing. One way is to find and treat pre-cancers before they become true cancers, and the other is to prevent the pre-cancers in the first place. The presented approach...

متن کامل

Sensitivity based Generalization Error for Supervised Learning Problem with Applications in Model Selection and Feature Selection

Generalization error model provides a theoretical support for a classifier's performance in terms of prediction accuracy. However, existing models give very loose error bounds. This explains why classification systems generally rely on experimental validation for their claims on prediction accuracy. In this talk we will revisit this problem and explore the idea of developing a new generalizatio...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006