On the Optimality of Probability Estimation by Random Decision Trees
نویسنده
چکیده
Random decision tree is an ensemble of decision trees. The feature at any node of a tree in the ensemble is chosen randomly from remaining features. A chosen discrete feature on a decision path cannot be chosen again. Continuous feature can be chosen multiple times, however, with a different splitting value each time. During classification, each tree outputs raw posterior probability. The probabilities from each tree in the ensemble are averaged as the final posterior probability estimate. Although remarkably simple and somehow counter-intuitive, random decision tree has been shown to be highly accurate under 0-1 loss and cost-sensitive loss functions. Preliminary explanation of its high accuracy is due to the “error-tolerance” property of probabilistic decision making. Our study has shown that the actual reason for random tree’s superior performance is due to its optimal approximation to each example’s true probability to be a member of a given class. Introduction Given an unknown target function y = F (x) and a set of examples of this target function {(x, y)}, a classification algorithm constructs an inductive model that approximates the unknown target function. Each example x is a feature vector of discrete and continuous values such as age, income, education, and salary. y is drawn from a discrete set of values such as {fraud, nonfraud}. A classification tree or decision tree is a directed single-rooted acyclic graph (sr-DGA) ordered feature tests. Each internal node of a decision tree is a feature test. Prediction is made at leaf nodes. Decision trees classify examples by sorting them down the tree from the root to some leaf node. Each non-leaf node in the tree specifies a test of some feature of that example. For symbolic or discrete features, each branch descending from the node specifies to one of the possible values of this feature. For continuous values, one branch corresponds to instances with feature value ≥ the threshold and another one < the threshold. Different instances are classified by different paths starting at the root of the tree and ending at a leaf. Some instances, e.g., with missing attribute values etc., may be split among multiple paths. w is the weight of an Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. instance x; it is set to 1.0 initially or other real numbers proportional to the probability that x is sampled. When x splits among multiple paths, the weight is split among different paths usually proportional to the probability of that path. If x has a missing value at an attribute test and the attribute has value ‘A’ with probability 0.9 and value ‘B’ with probability 0.1 in the training data, x will be classified by both path ‘A’and path ‘B’ with weights 0.9w and 0.1w respectively. Since every path is unique and every possible split is disjoint, the sum of all weights at every leaf node is the sum of the weight of every instance. A leaf is a collection of examples that may not be classified any further. Ideally, they may all have one single class, in which case, there is no utility for further classification. In many cases, they may still have different class labels. They may not be classified any further because either additional feature tests cannot classify better or the number of examples are so small that fails a given statistical significance test. In these cases, the prediction at this leaf node is the majority class or the class label with the most number of occurrences. Since each path from the root to a leaf is unique, a decision tree shatters the instance place into multiple leaves. The performance of a decision tree is measured by some “loss function” specifically designed for different applications. Given a loss function L(t, y) where t is the true label and y is the predicted label, an optimal decision tree is one that minimizes the average loss L(t, y) for all examples, weighted by their probability. Typical examples of loss functions in data mining are 0-1 loss and cost-sensitive loss. For 0-1 loss, L(t, y) = 0 if t = y, otherwise L(t, y) = 1. For cost-sensitive loss, L(t, y) = c(x, t) if t = y, otherwise L(t, y) = w(x, y, t). In general, when correctly predicted, L(t, y) is only related to x and its true label t. When misclassified, L(t, y) is related to the example as well as its true label and the prediction. If the problem is not ill-defined, we expect c(x, t) ≤ w(x, t, y). For many problems, t is nondeterministic, i.e., if x is sampled repeatedly, different values of t may be given. This is due to various reasons, such as noise in the data, inadequate feature set, insufficient feature precision or stochastic nature of the problem (i.e., for the same x, F (x) returns different value at different time). It is usually difficult to know or measure apriori if a problem is deterministic. Without prior knowledge, it is hard to distinguish noise from stochastic nature of the problem. The optimal decision y∗ for x is the label that minimizes the expected loss Et(L(t, y∗)) for a given example x when x is sampled repeatedly and different t’s may be given. For 0-1 loss function, the optimal prediction is the most likely label or the label that appears the most often when x is sampled repeatedly. For cost-sensitive loss, the optimal prediction is the one that minimizes the empirical risk. To choose the optimal decision, a posteriori probability is usually required. In decision tree, assume that nc is the number or weight of examples with class label c at a leaf node, and n is the total number or weight of examples at the leaf. The raw a posteriori probability can be estimated as
منابع مشابه
On the Minimax Optimality of Block Thresholded Wavelets Estimators for ?-Mixing Process
We propose a wavelet based regression function estimator for the estimation of the regression function for a sequence of ?-missing random variables with a common one-dimensional probability density function. Some asymptotic properties of the proposed estimator based on block thresholding are investigated. It is found that the estimators achieve optimal minimax convergence rates over large class...
متن کاملA Study on the Accuracy and Precision of Estimation of the Number, Basal Area and Standing Trees Volume per Hectare Using of some Sampling Methods in Forests of NavAsalem
The present study aimed to investigate the accuracy and precision estimation of the number, basal area and volume of the standing trees by methods of random and systematic random sampling in the forests of West Guilan. The cost or inventory time was determined using the criteria (E%2 × T). Inventory was carried out by complete sampling (census) in an area of 52 hectares. The study area (sect...
متن کاملOn Efficiency Criteria in Density Estimation
We discuss the classical efficiency criteria in density estimation and propose some variants. The context is a general density estimation scheme that contains the cases of i.i.d. or dependent random variables, in discrete or continuous time. Unbiased estimation, optimality and asymptotic optimality are considered. An example of a density estimator that satisfies some suggested criteria is given...
متن کاملThe eccentric connectivity index of bucket recursive trees
If $G$ is a connected graph with vertex set $V$, then the eccentric connectivity index of $G$, $xi^c(G)$, is defined as $sum_{vin V(G)}deg(v)ecc(v)$ where $deg(v)$ is the degree of a vertex $v$ and $ecc(v)$ is its eccentricity. In this paper we show some convergence in probability and an asymptotic normality based on this index in random bucket recursive trees.
متن کاملWavelet Based Estimation of the Derivatives of a Density for m-Dependent Random Variables
Here, we propose a method of estimation of the derivatives of probability density based wavelets methods for a sequence of m−dependent random variables with a common one-dimensional probability density function and obtain an upper bound on Lp-losses for the such estimators.
متن کاملEnsembles of Probability Estimation Trees for Customer Churn Prediction
Customer churn prediction is one of the most important elements of any Customer Relationship Management (CRM) strategy. In this study, a number of strategies are investigated to increase the lift of ensemble classification models. In order to increase lift performance, two elements of a number of well-known ensemble strategies are altered: (i) the potential of using probability estimation trees...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004