Statistical Decision and Learning Theory

نویسنده

  • Robert Nowak
چکیده

This paper reviews and contrasts the basic elements of statistical decision theory [1–4] and statistical learning theory [5–7]. It is not intended to be a comprehensive treatment of either subject, but rather just enough to draw comparisons between the two. Throughout this paper, let X denote the input to a decision-making process and Y denote the correct response or output (e.g., the value of a parameter, the label of a class, the signal of interest). We assume that X and Y are random variables or random vectors with joint distribution PX,Y (x, y), where x and y denote specific values that may be taken by the random variables X and Y , respectively. The observation X is used to make decisions pertaining to the quantity of interest. For the purposes of illustration, we will focus on the task of determining the value of the quantity of interest. A decision rule for this task is a function f that takes the observation X as input and outputs a prediction of the quantity Y . We denote a decision rule by Ŷ or f(X), when we wish to indicate explicitly the dependence of the decision rule on the observation. We will examine techniques for designing decision rules and for analyzing their performance. 0.1 Measuring Decision Accuracy: Loss and Risk Functions The accuracy of a decision is measured with a loss function. For example, if our goal is to determine the value of Y , then a loss function takes as inputs the true value Y and the predicted value (the decision) Ŷ = f(X) and outputs a non-negative real number (the “loss”) reflective of the accuracy of the decision. Two of the most commonly encountered loss functions include: 1. 0/1 loss: `0/1(Ŷ , Y ) = Ib Y 6=Y , which is the indicator function taking the value of 1 when Ŷ 6= Y and taking the value 0 when Ŷ (X) = Y . 2. squared error loss: `2(Ŷ , Y ) = ‖Ŷ −Y ‖2, which is simply the sum of squared differences between the elements of Ŷ and Y . The 0/1 loss is commonly used in detection and classification problems, and the squared error loss is more appropriate for problems involving the estimation of a continuous parameter. Note that since the inputs to the loss function may be random variables, so is the loss. A risk R(f) is a function of the decision rule f , and is defined to be the expectation of a loss with respect to the joint distribution PX,Y (x, y). For example, the expected 0/1 loss produces the probability of error risk function; i.e., a simply calculation shows that R0/1(f) = E[(If(X) 6=Y ] = Pr(f(X) 6= Y ). The expected squared error loss produces the mean squared error MSE risk function, R2(f) = E[‖f(X)− Y ‖2]. Optimal decisions are obtained by choosing a decision rule f that minimizes the desired risk function. Given complete knowledge of the probability distributions involved (e.g., PX,Y (x, y)) one can explicitly or numerically design an optimal decision rule, denoted f∗, that minimizes the risk function. 0.2 The Maximum Likelihood Principle The conditional distribution of the observation X given the quantity of interest Y is denoted by PX|Y (x|y). The conditional distribution PX|Y (x|y) can be viewed as a generative model, probabilistically describing the observations resulting from a given value, y, of the quantity of interest. For example, if y is the value of a parameter, the PX|Y (x|y) is the probability distribution of the observation X when the parameter value is set to y. If X is a continuous random variable with conditional density pX|Y (x|y) or a discrete random variable with conditional probability mass function (pmf) pX|Y (x|y), then given a value y we can assess the probability of a particular measurment value y by the magnitude of either the conditional density or pmf. In decision making problems, we know the value of the observation, but do not know the value y. Therefore, it is appealing to consider the conditional density or pmf as a function of the unknown values y, with X fixed at its observed value. The resulting function is called the likelihood function. As the name suggests, values of y where the likelihood function is largest are intuitively reasonable indicators of true value of the unknown quantity, which we will denote by y∗. The rationale for this is that these values would produce conditional densities or pmfs that place high probability on the observation X = x. The Maximum Likelihood Estimator (MLE) is defined to be the value of y that maximizes the likelihood function; i.e., in the continuous case ŷ(X) = argmax y pX|Y (X|y) with an analogous definition for the discrete case by replacing the conditional density with the conditional pmf. The decision rule ŷ(X) is called an “estimator,” which is common in decision problems involving a continuous parameter. Note that maximizing the likelihood function is equivalent to minimizing the negative log-likelihood function (since the logarithm is a monotonic transformation). Now let y∗ denote the true value of Y . Then we can view the negative log-likelihood as a loss function `L(y, y∗) = − log pX|Y (X|y) where the dependence on y∗ on the right hand side is embodied in the observation X on the left. An interesting special case of the MLE results when the conditional density PX|Y (X|y) is a Gaussian, in which case the negative log-likelihood corresponds to a squared error loss function. Now let us consider the expectation of this loss, with respect to the conditional distribution PX|Y (X|y∗): −E[log pX|Y (X|y)] = ∫ log ( 1 pX|Y (x|y) ) pX|Y (x|y∗)dx The true value y∗ minimizes the expected negative log-likelihood (or, equivalently, maximizes the expected log-likelihood ). To see this, compare the expected log-likelihood of y∗ with that of any other value y: E[log pX|Y (X|y∗)− log pX|Y (X|y)] = E [ log ( pX|Y (X|y∗) pX|Y (X|y) )] = ∫ log ( pX|Y (x|y∗) pX|Y (x|y) ) pX|Y (x|y∗)dx = KL(pX|Y (x|y∗), pX|Y (x|y)) (1) The quantity KL(pX|Y (x|y∗), pX|Y (x|y)) is called the Kullback-Leibler (KL) divergence between the conditional density function pX|Y (x|y∗) and pX|Y (x|y). The KL divergence is non-negative, and zero if and only if the two densities are equal [4]. So, we see that the KL divergence acts as a sort of risk function in the context of Maximum Likelihood Estimation. 0.3 The Cramer-Rao Lower Bound The MLE is based on finding the value for Y that maximizes the likelihood function. Intuitively, if the maximum point is very distinct, say a well isolated peak in the likelihood function, then the easier it will be to distinguish the MLE from alternative decisions. Consider the case in which Y is a scalar quantity. The “peakiness” of the log-likelihood function can be gauged by examining its curvature, − 2 log pX|Y (x|y) ∂y2 , at the point of maximum likelihood. The higher the curvature, the more peaky is the behavior of the likelihood function at the maximum point. Of course, we hope that the MLE will be a good predictor (decision) for the unknown true value y∗. So, rather than looking at the curvature of the log-likelihood function at the maximum likelihood point, a more appropriate measure of how easily it will be to distinguish y∗ from the alternatives is the expected curvature of the log-likelihood function evaluated at the value y∗. The expectation taken over all possible observations with respect to the conditional density pX|Y (x|y∗). This quantity, denoted I(y∗) = E[− ∂2 log pX|Y (x|y) ∂y2 ]|y=y∗ , is called the Fisher Information (FI). In fact, the FI provides us with an important performance bound known as the Cramer-Rao Lower Bound (CRLB). The CRLB states that under some mild regularity assumptions about the conditional density function pX|Y (x|y), the variance of any unbiased estimator is bounded from below by the inverse of the I(y∗) [1–3]. Recall that an unbiased estimator is any estimator Ŷ that satisfies E[Ŷ ] = y∗. The CRLB tells us is that

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching for the Origins of Schwab's Deliberative Curriculum Theory in the Thoughts of Aristotle, Dewey and Habermas

The main purpose of this study is exploring the roots and foundations of Schwab’s deliberative theory in curriculum. Therefore, after examining this theory in introduction, its foundations and origins were investigated. According to this, basic assumptions of this theory are practical and quasi practical arts, eclectic arts, commonplace and collective decision. Aristotle’s distinction between i...

متن کامل

Classification of encrypted traffic for applications based on statistical features

Traffic classification plays an important role in many aspects of network management such as identifying type of the transferred data, detection of malware applications, applying policies to restrict network accesses and so on. Basic methods in this field were using some obvious traffic features like port number and protocol type to classify the traffic type. However, recent changes in applicat...

متن کامل

A Support Vector Machine Approach to Decision Trees

Key ideas from statistical learning theory and support vector machines are generalized to decision trees. A support vector machine is used for each decision in the tree. The \optimal" decision tree is characterized, and both a primal and dual space formulation for constructing the tree are proposed. The result is a method for generating logically simple decision trees with multivariate linear o...

متن کامل

Application of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)

Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...

متن کامل

Effective Distance Teaching and learning in Higher Education

Nowadays, Universities have come across a main transformation. Lack of budget, an increase in the number of university students, a change in the student population, up-to-date and various educational needs of each society require fundamental changes that are coordinated with recent needs. This study aimed to evaluate the features of effective distance education in higher education. Findings of ...

متن کامل

Conjugate relation between loss functions and uncertainty sets in classification problems

There are two main approaches to binary classification problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss functio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007