imbalanced data sets

An Imbalanced Data Rule Learner

2005

Canh Hao Nguyen Tu Bao Ho

Imbalanced data learning has recently begun to receive much attention from research and industrial communities as traditional machine learners no longer give satisfactory results. Solutions to the problem generally attempt to adapt standard learners to the imbalanced data setting. Basically, higher weights are assigned to small class examples to avoid their being overshadowed by the large class...

متن کامل

Title: A PRIORI SYNTHETIC SAMPLING FOR INCREASING CLASSIFICATION SENSITIVITY IN IMBALANCED DATA SETS

2017

William Rivera

Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in many domains including fraud detection, mammography of cancer and post term births. These rare events are usually the most costly or have...

متن کامل

Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Journal: :IJDWM 2008

Vladimir Nikulin

Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where...

متن کامل

B-ROC Curves for the Assessment of Classifiers over Imbalanced Data Sets

2006

Alvaro A. Cárdenas John S. Baras

The class imbalance problem appears to be ubiquitous to a large portion of the machine learning and data mining communities. One of the key questions in this setting is how to evaluate the learning algorithms in the case of class imbalances. In this paper we introduce the Bayesian Receiver Operating Characteristic (B-ROC) curves, as a set of tradeoff curves that combine in an intuitive way, the...

متن کامل

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

2007

M. Karagiannopoulos D. Anyfantis Sotiris B. Kotsiantis Panayiotis E. Pintelas

A classifier induced from an imbalanced data set has a low error rate for the majority class and an undesirable error rate for the minority class. This paper firstly provides a systematic study on the various methodologies that have tried to handle this problem. Finally, it presents an experimental study of these methodologies with a proposed wrapper for reweighting training instances and it co...

متن کامل

An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets

2007

Vicente García José Salvador Sánchez Ramón Alberto Mollineda

Class imbalance has been reported as an important obstacle to apply traditional learning algorithms to real-world domains. Recent investigations have questioned whether the imbalance is the unique factor that hinders the performance of classifiers. In this paper, we study the behavior of six algorithms when classifying imbalanced, overlapped data sets under uncommon situations (e.g., when the o...

متن کامل

An Approach to Imbalanced Data Sets Based on Changing Rule Strength

2004

Jerzy W. Grzymala-Busse Linda K. Goodwin Witold J. Grzymala-Busse Xinqun Zheng

This paper describes experiments with a challenging data set describing preterm births. The data set, collected at the Duke University Medical Center, was large and, at the same time, many attribute values were missing. However, the main problem was that only 20.7% of the total number of cases represented the important preterm birth class. Thus the data set was imbalanced. For comparison, we in...

متن کامل

Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning

2010

Alberto Fernández María José del Jesús Francisco Herrera

In a classification task, the imbalance class problem is present when the data-set has a very different distribution of examples among their classes. The main handicap of this type of problem is that standard learning algorithms consider a balanced training set and this supposes a bias towards the majority classes. In order to provide a correct identification of the different classes of the pro...

متن کامل

GP Classification under Imbalanced Data sets: Active Sub-sampling and AUC Approximation

2008

John A. Doucette Malcolm I. Heywood

The problem of evolving binary classification models under increasingly unbalanced data sets is approached by proposing a strategy consisting of two components: Sub-sampling and ‘robust’ fitness function design. In particular, recent work in the wider machine learning literature has recognized that maintaining the original distribution of exemplars during training is often not appropriate for d...

متن کامل

Peculiar Genes Selection: A new features selection method to improve classification performances in imbalanced data sets

2017

Federica Martina Marco Beccuti Gianfranco Balbo Francesca Cordero

High-Throughput technologies provide genomic and trascriptomic data that are suitable for biomarker detection for classification purposes. However, the high dimension of the output of such technologies and the characteristics of the data sets analysed represent an issue for the classification task. Here we present a new feature selection method based on three steps to detect class-specific biom...

متن کامل