Weight adjustment schemes for a centroid based classifier ∗
نویسندگان
چکیده
In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on these huge resources. Similarity based categorization algorithms such as k-nearest neighbor, generalized instance set and centroid based classification have been shown to be very effective in document categorization. A major drawback of these algorithms is that they use all features when computing the similarities. In many document data sets, only a small number of the total vocabulary may be useful for categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this report we present two fast iterative feature weight adjustment algorithms for the linearcomplexity centroid based classification algorithm. Our algorithms use a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. We experimentally evaluate our algorithms on the Reuters-21578 and OHSUMED document collections and compare it against Rocchio, Widrow-Hoff and SVM. We also compared its performance in terms of classification accuracy on data sets with multiple classes. On these data sets we compared its performance against traditional classifiers such as k-nn, Naive Bayesian and C4.5. Experiments show that feature weight adjustment improves the performance of the centroid-based classifier by 25% , substantially outperforms Rocchio and Widrow-Hoff and is competitive with SVM. These algorithms also outperform traditional classifiers such as k-nn, naive bayesian and C4.5 on the multi-class text document data sets. ∗This work was supported by NSF CCR-9972519, by Army Research Office contract DA/DAAG55-98-1-0441, by the DOE ASCI program, and by Army High Performance Computing Research Center contract number DAAH04-95-C-0008. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute. Related papers are available via WWW at URL: http://www.cs.umn.edu/ ̃karypis
منابع مشابه
Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier *
In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...
متن کاملInverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملInverse Category Frequency based supervised term weighting scheme for text categorization
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) int...
متن کاملAn Effective Approach to Enhance Centroid Classifier for Text Categorization
Centroid Classifier has been shown to be a simple and yet effective method for text categorization. However, it is often plagued with model misfit (or inductive bias) incurred by its assumption. To address this issue, a novel Model Adjustment algorithm was proposed. The basic idea is to make use of some criteria to adjust Centroid Classifier model. In this work, the criteria include training-se...
متن کاملMultiple attribute group decision making with linguistic variables and complete unknown weight information
Interval type-2 fuzzy sets, each of which is characterized by the footprint of uncertainty, are a very useful means to depict the linguistic information in the process of decision making. In this article, we investigate the group decision making problems in which all the linguistic information provided by the decision makers is expressed as interval type-2 fuzzy decision matrices where each of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000