Feature Selection for Improving Case-Based Classifiers on High-Dimensional Data Sets
نویسندگان
چکیده
Case-based reasoning (CBR) is a suitable paradigm for class discovery in molecular biology, where the rules that define the domain knowledge are difficult to obtain, and there is not sufficient knowledge for formal knowledge representation. To extend the capabilities of this paradigm, we propose logistic regression for CBR (LR4CBR), a method that uses logistic regression as a feature selection (FS) method for CBR systems. Our method not only improves the prediction accuracy of CBR classifiers in biomedical domains, but also selects a subset of features that have meaningful relationships with their class labels. In this paper, we introduce two methods to rank features for logistic regression. We show that using logistic regression as a filter FS method outperforms other FS techniques, such as Fisher and t-test, which have been widely used in analyzing biological data sets. The FS methods are combined with a computational framework for a CBR system called TA3 . We also evaluate the method on two mass spectrometry data sets, and show that the prediction accuracy of TA3 improves from 90% to 98% and from 79.2% to 95.4%. Finally, we compare our list of discovered biomarkers with the lists of selected biomarkers from other studies for the mass spectrometry data sets, and show the overlapping biomarkers.
منابع مشابه
Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کاملFeature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets
Objective(s): This study addresses feature selection for breast cancer diagnosis. The present process uses a wrapper approach using GA-based on feature selection and PS-classifier. The results of experiment show that the proposed model is comparable to the other models on Wisconsin breast cancer datasets. Materials and Methods: To evaluate effectiveness of proposed feature selection method, we ...
متن کاملA hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملAccurate Fault Classification of Transmission Line Using Wavelet Transform and Probabilistic Neural Network
Fault classification in distance protection of transmission lines, with considering the wide variation in the fault operating conditions, has been very challenging task. This paper presents a probabilistic neural network (PNN) and new feature selection technique for fault classification in transmission lines. Initially, wavelet transform is used for feature extraction from half cycle of post-fa...
متن کاملA Feature Subset Selection Method Based On High-Dimensional Mutual Information
Feature selection is an important step in building accurate classifiers and provides better understanding of the data sets. In this paper, we propose a feature subset selection method based on high-dimensional mutual information. We also propose to use the entropy of the class attribute as a criterion to determine the appropriate subset of features when building classifiers. We prove that if th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005