Prediction by Categorical Features: Generalization Properties and Application to Feature Ranking

نویسندگان

  • Sivan Sabato
  • Shai Shalev-Shwartz
چکیده

We describe and analyze a new approach for feature ranking in the presence of categorical features with a large number of possible values. It is shown that popular ranking criteria, such as the Gini index and the misclassification error, can be interpreted as the training error of a predictor that is deduced from the training set. It is then argued that using the generalization error is a more adequate ranking criterion. We propose a modification of the Gini index criterion, based on a robust estimation of the generalization error of a predictor associated with the Gini index. The properties of this new estimator are analyzed, showing that for most training sets, it produces an accurate estimation of the true generalization error. We then address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ranking Categorical Features Using Generalization Properties

Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This app...

متن کامل

An Hybrid Approach to Feature Selection for Mixed Categorical and Continuous Data

This paper proposes an algorithm for feature selection in the case of mixed data. It consists in ranking independently the categorical and the continuous features before recombining them according to the accuracy of a classifier. The popular mutual information criterion is used in both ranking procedures. The proposed algorithm thus avoids the use of any similarity measure between samples descr...

متن کامل

Bayesian Support Vector Machines for Feature Ranking and Selection

In this chapter, we develop and evaluate a feature selection algorithm for Bayesian support vector machines. The relevance level of features are represented by ARD (automatic relevance determination) parameters, which are optimized by maximizing the model evidence in the Bayesian framework. The features are ranked in descending order using the optimal ARD values, and then forward selection is c...

متن کامل

Intelligent application for Heart disease detection using Hybrid Optimization algorithm

Prediction of heart disease is very important because it is one of the causes of death around the world. Moreover, heart disease prediction in the early stage plays a main role in the treatment and recovery disease and reduces costs of diagnosis disease and side effects it. Machine learning algorithms are able to identify an effective pattern for diagnosis and treatment of the disease and ident...

متن کامل

The Application of Numerical Analysis Techniques to Pattern Recognition of Helicopters by Area Method

In this paper, a new method to selecting different viewing angles feature vector is introduced to recognition different types of Helicopters. Feature vector 32 components based on characteristics of the shape, Area and a length to describe a binary two-dimensional image was created, shape feature and length feature not only effective but area features effective and were used. New features vecto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007