Sampling Bias and Class Imbalance in Maximum-likelihood Logistic Regression

نویسندگان

  • Thomas Oommen
  • Laurie G. Baise
  • Richard M. Vogel
چکیده

Logistic regression is a widely used statistical method to relate a binary response variable to a set of explanatory variables and maximum likelihood is the most commonly used method for parameter estimation. A maximum-likelihood logistic regression (MLLR) model predicts the probability of the event from binary data defining the event. Currently, MLLR models are used in a myriad of fields including geosciences, natural hazard evaluation, medical diagnosis, homeland security, finance, and many others. In such applications, the empirical sample data often exhibit class imbalance, where one class is represented by a large number of events while the other is represented by only a few. In addition, the data also exhibit sampling bias, which occurs when there is a difference between the class distribution in the sample compared to the actual class distribution in the population. Previous studies have evaluated how class imbalance and sampling bias affect the predictive capability of asymptotic classification algorithms such as MLLR, yet no definitive conclusions have been reached. We hypothesize that the predictive capability of the model is related to the sampling bias associated with the data so that the MLLR model has perfect predictability when the data have no sampling bias. We test our hypotheses using two simulated datasets with class distributions that are 50:50 and 80:20, respectively. We construct a suite of controlled experiments by extracting multiple samples with varying class imbalance and sampling bias from the two simulated datasets and fitting MLLR models to each of these samples. The experiments suggest that it is important to develop a sample that has the same class distribution as the original population rather than T. Oommen ( ) · L.G. Baise · R.M. Vogel Department of Civil and Environmental Engineering, Tufts University, 113 Anderson Hall, Medford, MA 02155, USA e-mail: [email protected] Present address: T. Oommen Dept. of Geological Engineering, Michigan Tech., Houghton, MI 49931, USA 100 Math Geosci (2011) 43: 99–120 ensuring that the classes are balanced. Furthermore, when sampling bias is reduced either by using over-sampling or under-sampling, both sampling techniques can improve the predictive capability of an MLLR model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Maximum Likelihood Estimation and Bayesian with Generalized Gibbs Sampling for Ordinal Regression Analysis of Ovarian Hyperstimulation Syndrome

Background and Objectives: Analysis of ordinal data outcomes could lead to bias estimates and large variance in sparse one. The objective of this study is to compare parameter estimates of an ordinal regression model under maximum likelihood and Bayesian framework with generalized Gibbs sampling. The models were used to analyze ovarian hyperstimulation syndrome data.   Methods: This study use...

متن کامل

Bayesian and Iterative Maximum Likelihood Estimation of the Coefficients in Logistic Regression Analysis with Linked Data

This paper considers logistic regression analysis with linked data. It is shown that, in logistic regression analysis with linked data, a finite mixture of Bernoulli distributions can be used for modeling the response variables. We proposed an iterative maximum likelihood estimator for the regression coefficients that takes the matching probabilities into account. Next, the Bayesian counterpart...

متن کامل

Spatial Regression in the Presence of Misaligned data

In this paper, four approaches are presented to the problem of fitting a linear regression model in the presence of spatially misaligned data. These approaches are plug-in method‎, ‎simulation‎, ‎regression calibration and maximum likelihood‎. In the first two approaches‎, ‎with modeling the correlation between the explanatory variable, prediction of explanatory variable is determined at sites...

متن کامل

Estimation of Parameters for an Extended Generalized Half Logistic Distribution Based on Complete and Censored Data

This paper considers an Extended Generalized Half Logistic distribution. We derive some properties of this distribution and then we discuss estimation of the distribution parameters by the methods of moments, maximum likelihood and the new method of minimum spacing distance estimator based on complete data. Also, maximum likelihood equations for estimating the parameters based on Type-I and Typ...

متن کامل

Asymptotic properties of a double penalized maximum likelihood estimator in logistic regression

Maximum likelihood estimates in logistic regression may encounter serious bias or even non-existence with many covariates or with highly correlated covariates. In this paper, we show that a double penalized maximum likelihood estimator is asymptotically consistent in large samples. r 2007 Elsevier B.V. All rights reserved.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010