A Study of Local and Global Thresholding Techniques in Text Categorization
نویسندگان
چکیده
Feature Filtering is an approach that is widely used for dimensionality reduction in text categorization. In this approach feature scoring methods are used to evaluate features leading to selection. Thresholding is then applied to select the highest scoring features either locally or globally. In this paper, we investigate several local and global feature selection methods. The usage of Standard Deviation (STD) and Maximum Deviation (MD) as globalization schemes is suggested. This work provides a comparative study among fourteen thresholding techniques using different scoring methods and benchmark datasets of diverse nature. This includes investigation of normalizing feature scores before combining them in the global pool. The results suggest that normalized MD outperforms other methods in thresholding Document Frequency (DF) scores using even and moderate diverse data-sets. Furthermore, the results indicated that normalizing feature scores improves the performance of rare categories and balances the bias of some techniques to frequent categories.
منابع مشابه
Enhanced the Image Segmentation Process Based on Local and Global Thresholding
Image processing plays an important role in computer vision. The process of image segmentation provides the partition of image into different segments according to their feature attribute. Region based segmentation is a type similarity based segmentation. Another type of segmentation is called thresholding based segmentation. In thresholding based segmentation method some thresholding technique...
متن کاملA Novel Degraded Document Image Binarazation by using Local Thresholding Segmentation
The proposed binarization is a scheme of parting a image pixel values into two classes black as foreground and white pixels as background then the thresholding is found for well known scheme for document image binarization. In this proposed work for the decomposition of both global and local thresholding this basic thresholding value we can use further. Here the global thresholding scheme is ef...
متن کاملA Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization
Two main research areas in statistical text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe...
متن کاملAn Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches
Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...
متن کاملKAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold in Similarity-Based Text Categorization
Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006