Nonparametric Assessment of Contamination in Multivariate Data Using Generalized Quantile Sets and FDR
نویسندگان
چکیده
Large, multivariate datasets from high-throughput instrumentation have become ubiquitous in the sciences. Frequently, it is of interest to characterize the measurements in these datasets by the extent to which they represent ‘nominal’ versus ‘contaminated’ instances. However, often the nature of even the nominal patterns in the data are unknown and potentially quite complex, making their explicit parametric modeling a daunting task. In this paper, we introduce a nonparametric method for the simultaneous annotation of multivariate data (called MNSCAnn), by which one may produce an annotated ranking of the observations, indicating the relative extent to which each may or may not be considered nominal, while making minimal assumptions on the nature of the nominal distribution. In our framework each observation is linked to a corresponding generalized quantile set and, implicitly adopting a hypothesis testing perspective, each set is associated with a test, which in turn is accompanied by a certain false discovery rate. The combination of generalized quantile set methods with false discovery rate principles, in the context of contaminated data, is new, and estimation of the key underlying quantities requires that a number of issues be addressed. We illustrate MN-SCAnn through examples in two contexts: the pre-processing of cell-based assays in bioinformatics, and the detection of anomalous traffic patterns in Internet measurement studies. ∗Department of Electrical Engineering and Computer Science, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48105. Email: cscott-at-eecs-dot-umich-dot-edu †Department of Mathematics and Statistics, Boston University, 111 Cummington Street, Boston, MA 02215. Email: kolaczyk-at-math-dot-bu-dot-edu
منابع مشابه
Model-based approaches to nonparametric Bayesian quantile regression
In several regression applications, a different structural relationship might be anticipated for the higher or lower responses than the average responses. In such cases, quantile regression analysis can uncover important features that would likely be overlooked by mean regression. We develop two distinct Bayesian approaches to fully nonparametric model-based quantile regression. The first appro...
متن کاملNonparametric Assessment of Contamination in Multivariate Data Using Minimum Volume Sets and FDR
Large, multivariate datasets from high-throughput instrumentation have become ubiquitous throughout the sciences. Frequently, it is of great interest to characterize the measurements in these datasets by the extent to which they represent ‘nominal’ versus ‘contaminated’ instances. However, often the nature of even the nominal patterns in the data are unknown and potentially quite complex, makin...
متن کاملBayesian Nonparametric Modeling in Quantile Regression
We propose Bayesian nonparametric methodology for quantile regression modeling. In particular, we develop Dirichlet process mixture models for the error distribution in an additive quantile regression formulation. The proposed nonparametric prior probability models allow the data to drive the shape of the error density and thus provide more reliable predictive inference than models based on par...
متن کاملNonparametric multivariate conditional distribution and quantile regression
In nonparametric multivariate regression analysis, one usually seeks methods to reduce the dimensionality of the regression function to bypass the difficulty caused by the curse of dimensionality. We study nonparametric estimation of multivariate conditional distribution and quantile regression via local univariate quadratic estimation of partial derivatives of bivariate copulas. Without restri...
متن کاملA Frisch-newton Algorithm for Sparse Quantile Regression
Recent experience has shown that interior-point methods using a log barrier approach are far superior to classical simplex methods for computing solutions to large parametric quantile regression problems. In many large empirical applications, the design matrix has a very sparse structure. A typical example is the classical fixed-effect model for panel data where the parametric dimension of the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007