Importance Sampling with Unequal Support
نویسندگان
چکیده
Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sampling that can reduce the variance of importance samplingbased estimates by orders of magnitude when the supports of the training and testing distributions differ. After motivating and presenting our new importance sampling estimator, we provide a detailed theoretical analysis that characterizes both its bias and variance relative to the ordinary importance sampling estimator (in various settings, which include cases where ordinary importance sampling is biased, while our new estimator is not, and vice versa). We conclude with an example of how our new importance sampling estimator can be used to improve estimates of how well a new treatment policy for diabetes will work for an individual, using only data from when the individual used a previous treatment policy. Introduction A key challenge in artificial intelligence is to estimate the expectation of a random variable. Instances of this problem arise in areas ranging from planning and decision making (e.g., estimating the expected sum of rewards produced by a policy for decision making under uncertainty) to probabilistic inference. Although the estimation of an expected value is straightforward if we can generate many independent and identically distributed (i.i.d.) samples from the relevant probability distribution (which we refer to as the target distribution), we may not have generative access to the target distribution. Instead, we might only have data from a different distribution that we call the sampling distribution. For example, in off-policy evaluation for reinforcement learning, the goal is to estimate the expected sum of rewards that a decision policy will produce, given only data gathered using some other policy. Similarly, in supervised learning, we may wish to predict the performance of a regressor or classifier if it were to be applied to data that comes from a distribution that differs from the distribution of the available data (e.g., we might predict the accuracy of a classifier for hand-written letters given that observed letter frequencies come from English, using a corpus of labeled letters collected from German documents). Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. More precisely, we consider the problem of estimating θ := E[h(X)], where h is a real-valued function and the expectation is over the random variable X , which is a sample from the target distribution. As input we assume access to n i.i.d. samples from a sampling distribution that is different from the target distribution. A classical approach to this problem is to use importance sampling (IS), which reweights the observed samples to account for the difference between the target and sampling distributions (Kahn, 1955). Importance sampling produces an unbiased but often highvariance estimate of θ. We introduce importance sampling with unequal support (US)—a simple new importance sampling estimator that can drastically reduce the variance of importance sampling when the supports of the sampling and target distributions differ. This setting with unequal support can occur, for example, in our earlier example where German documents might include symbols like ß, that the classifier will not encounter. US essentially performs importance sampling only on the data that falls within the support of the target distribution, and then scales this estimate by a constant that reflects the relative support of the target and sampling distributions. US typically has lower variance than ordinary importance sampling (sometimes by orders of magnitude), and is unbiased in the important setting where at least one sample falls within the support of the target distribution. If no samples do, then none of the available data could have been generated by the target distribution, and so it is unclear what would make for a reasonable estimate. Furthermore, the conditionally unbiased nature of US is sufficient to allow for its use with concentration inequalities like Hoeffding’s inequality to construct confidence bounds on θ. By contrast, weighted importance sampling (Rubinstein, 1981) is another variant of importance sampling that can reduce variance, but which introduces bias that makes it incompatible with Hoeffding’s inequality. Problem Setting and Importance Sampling Let f and g be probability density functions (PDFs) for two distributions that we call the target distribution and sampling distribution, respectively. Let h : R → R be called the evaluation function. Let θ := Ef [h(X)], where Ef denotes the expected value given that f is the PDF of the random variable(s) in the expectation (in this case, just X). Let F := {x ∈ R : f(x) 6= 0}, G := {x ∈ R : g(x) 6= 0}, and H := {x ∈ R : h(x) 6= 0} be the supports of the target and sampling distributions, and the evaluation function, respectively. In this paper we will discuss techniques for estimating θ given n ∈ N>0 i.i.d. samples, Xn := {X1, . . . , Xn}, from the sampling distribution, and we focus on the setting where F ∩H ⊂ G—where the joint support of F and H is a strict subset of the support of G. The importance sampling estimator, IS(Xn) := t+ 1 n n ∑ i=1 f(Xi) g(Xi) (h(Xi)− t), (1) is a widely used estimator of θ, where t = 0 (we consider non-zero values of t later). If F ∩ H ⊆ G, then IS(Xn) is a consistent and unbiased estimator of θ. That is, IS(Xn) a.s. −→ θ and Eg[IS(Xn)] = θ (we review this latter result in Property 1 in the supplemental document). A control variate is a constant, t ∈ R, that is subtracted from each h(Xi) and then added back to the final estimate, as in (1) (Hammersley, 1960; Hammersley and Handscomb, 1964). Although control variates, t(Xi), that depend on the sample, Xi, can be beneficial, for our later purposes we only consider constant control variates. Intuitively, including a constant control variate equates to estimating θ′ := Ef [h ′(X)] using importance sampling without a control variate, where h′(x) = h(x) − t, and then adding t to the resulting estimate to get an estimate of θ. Later we show that the variance of importance sampling increases with θ, and so applying importance sampling to h results in higher variance than applying importance sampling to h′ with t ≈ θ, since then θ′ ≈ 0. That is, by inducing a kind of normalization, a control variate can reduce the variance of estimates without introducing bias—a property that has made the inclusion of control variates a popular topic in some recent works using importance sampling (Dudı́k et al., 2011; Jiang and Li, 2016; Thomas and Brunskill, 2016). Although later we discuss control variates more, for simplicity our derivations focus on importance sampling estimators without control variates. There are also other extensions of the importance sampling estimator that can reduce variance—notably the weighted importance sampling estimator, which we compare to later, and which can provide large reductions of variance and mean squared error, but which introduces bias. An Illustrative Example In this section we present an example that highlights the peculiar behavior of the IS estimator when F ∩ H 6= G. The illustrative example, depicted in Figure 1, is defined as follows. Let g(x) = 0.5 if x ∈ [0, 2] and g(x) = 0 otherwise, and let f(x) = 1 if x ∈ [0, 1] and f(x) = 0 otherwise. So, F = [0, 1] and G = [0, 2]. Let h(x) = 1 if x ∈ [0, 1] and h(x) = 0 otherwise, so that H = [0, 1]. Notice that θ = 1. Since the sampling and target distributions are both uniform, an obvious estimator of θ (if f and g are known but h is not) would be the average of the points that fall within F . Let (#Xi ∈ F ) denote the number of samples in Xn that 1 2 1
منابع مشابه
Computation of Weighted Functional Statistics Using Software That Does Not Support Weights Computation of Weighted Functional Statistics Using Software That Does Not Support Weights
We discuss methods for calculating statistics for weighted samples using software that does not support weights. Such samples arise in survey sampling with unequal probabilities, importance sampling, and bootstrap tilting. The software might not support weights for reasons of eeciency, simplicity, or because it was quicker to write the software without supporting weights. We discuss several tec...
متن کاملFinite length LT codes over Fq for unequal error protection with biased sampling of input nodes
Finite length LT codes over higher order Galois fields Fq for unequal error protection (UEP) are analysed under maximum likelihood (ML) decoding. We consider a biased sampling method to create the LT code graph. In contrast to a previous approach by Rahnavard et al., where a predetermined number of edges is created per importance class given a check node of degree d, our procedure allows to pre...
متن کاملUnequal Protection Mechanism for Digital Speech Transmission Based on Turbo Codes
In this paper, the Turbo-based unequal protection mechanism for reliable transmission of speech signal is studied. In order to obtain the hierarchical importance regularity of information bits for each sampling point, the changing value caused by the variation of each bit in 8-bit folded code of pulse-code modulation is first calculated. According to the obtained hierarchical importance of info...
متن کاملStatistics of Polarization-Mode Dispersion Emulators with Unequal Sections
We study two models for the generation of polarization-mode dispersion (PMD) with unequal, fixed-length sections: an isotropic model, in which the orientations of all the sectional PMD vectors are taken to be randomly and uniformly varying across the Poincaré sphere, and a rotator model, in which all sections are taken to be linearly birefringent waveplates randomly rotatable with respect to on...
متن کاملA New Voting Model For Groups With Members of Unequal Power and Proficiency
To proposing a voting model for groups with members of unequal power and proficiency, we present some models for rank ordering efficient candidates, by extending the ideas of some authors. Then, we propose a new methodology to rank the ranking models for the performance indices of only DEA efficient candidates based on a classical voting model. Also, an approach for combining the results obtain...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017