K-norm Misclassification Rate Estimation for Decision Trees
نویسندگان
چکیده
The decision tree classifier is a well-known methodology for classification. It is widely accepted that a fully grown tree is usually over-fit to the training data and thus should be pruned back. In this paper, we analyze the overtraining issue theoretically using an the k-norm risk estimation approach with Lidstone’s Estimate. Our analysis allows the deeper understanding of decision tree classifiers, especially on how to estimate their misclassification rates using our equations. We propose a simple pruning algorithm based on our analysis and prove its superior properties, including its independence from validation and its efficiency.
منابع مشابه
ipred : Improved Predictors
In classification problems, there are several attempts to create rules which assign future observations to certain classes. Common methods are for example linear discriminant analysis or classification trees. Recent developments lead to substantial reduction of misclassification error in many applications. Bootstrap aggregation (“bagging”, Breiman, 1996a) combines classifiers trained on bootstr...
متن کاملImproved Class Probability Estimates from Decision Tree Models
Decision tree models typically give good classification decisions but poor probability estimates. In many applications, it is important to have good probability estimates as well. This paper introduces a new algorithm, Bagged Lazy Option Trees (B-LOTs), for constructing decision trees and compares it to an alternative, Bagged Probability Estimation Trees (B-PETs). The quality of the class proba...
متن کاملThe Comparison of Gini and Twoing Algorithms in Terms of Predictive Ability and Misclassification Cost in Data Mining: An Empirical Study
The classification tree is commonly used in data mining for investigating interaction among predictors, particularly. The splitting rule and the decision trees technique employ algorithms that are largely based on statistical and probability methods. Splitting procedure is the most important phase of classification tree training. The aim of this study is to compare Gini and Twoing splitting rul...
متن کاملMeasuring unsupervised acoustic clustering through phoneme pair merge-and-split tests
Subphonetic discovery through segmental clustering is a central step in building a corpus-based synthesizer. To help decide what clustering algorithm to use we employed mergeand-split tests on English fricatives. Compared to reference of 2%, Gaussian EM achieved a misclassification rate of 6%, Kmeans 10%, while predictive CART trees performed poorly.
متن کاملLoss Functions for Binary Class Probability Estimation and Classification: Structure and Applications
What are the natural loss functions or fitting criteria for binary class probability estimation? This question has a simple answer: so-called “proper scoring rules”, that is, functions that score probability estimates in view of data in a Fisher-consistent manner. Proper scoring rules comprise most loss functions currently in use: log-loss, squared error loss, boosting loss, and as limiting cas...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007