imbalanced data sets

A Survey of Predictive Modelling under Imbalanced Distributions

Journal: :CoRR 2015

Paula Branco Luís Torgo Rita P. Ribeiro

Many real world data mining applications involve obtaining predictive models using data sets with strongly imbalanced distributions of the target variable. Frequently, the least common values of this target variable are associated with events that are highly relevant for end users (e.g. fraud detection, unusual returns on stock markets, anticipation of catastrophes, etc.). Moreover, the events ...

متن کامل

A novel ensemble method for classifying imbalanced data

Journal: :Pattern Recognition 2015

Zhongbin Sun Qinbao Song Xiaoyan Zhu Heli Sun Baowen Xu Yuming Zhou

The class imbalance problems have been reported to severely hinder classification performance of many standard learning algorithms, and have attracted a great deal of attention from researchers of different fields. Therefore, a number of methods, such as sampling methods, cost-sensitive learning methods, and bagging and boosting based ensemble methods, have been proposed to solve these problems...

متن کامل

Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods

Journal: :Journal of chemical information and modeling 2013

Chia-Yun Chang Ming-Tsung Hsu Emilio Xavier Esposito Yufeng J. Tseng

The traditional biological assay is very time-consuming, and thus the ability to quickly screen large numbers of compounds against a specific biological target is appealing. To speed up the biological evaluation of compounds, high-throughput screening is widely used in the fields of biomedical, biological information, and drug discovery. The research presented in this study focuses on the use o...

متن کامل

Classifying Severely Imbalanced Data

2011

William Klement Szymon Wilk Wojtek Michalowski Stan Matwin

Learning from data with severe class imbalance is difficult. Established solutions include: under-sampling, adjusting classification threshold, and using an ensemble. We examine the performance of combining these solutions to balance the sensitivity and specificity for binary classifications, and to reduce the MSE score for probability estimation.

متن کامل

Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data

Journal: :Knowl.-Based Syst. 2016

Yijing Li Haixiang Guo Xiao Liu Yanan Li Jinling Li

Learning from imbalanced data, where the number of observations in one class is significantly rarer than in other classes, has gained considerable attention in the data mining community. Most existing literature focuses on binary imbalanced case while multi-class imbalanced learning is barely mentioned. What’s more, most proposed algorithms treated all imbalanced data consistently and aimed to ...

متن کامل

An information granulation based data mining approach for classifying imbalanced data

Journal: :Inf. Sci. 2008

Mu-Chen Chen Long-Sheng Chen Chun-Chin Hsu Wei-Rong Zeng

Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health exami...

متن کامل

Classification of imbalanced bioinformatics data by using boundary movement-based ELM.

Journal: :Bio-medical materials and engineering 2015

Ke Cheng Qingfang Chen Xibei Yang Shang Gao Hualong Yu

To address the imbalanced classification problem emerging in Bioinformatics, a boundary movement-based extreme learning machine (ELM) algorithm called BM-ELM was proposed. BM-ELM tries to firstly explore the prior information about data distribution by condensing all training instances into the one-dimensional feature space corresponding to the original output in ELM, and then on the transforme...

متن کامل

Multi-class protein fold classification using a new ensemble machine learning approach.

Journal: :Genome informatics. International Conference on Genome Informatics 2003

Aik Choon Tan David Gilbert Yves Deville

Protein structure classification represents an important process in understanding the associations between sequence and structure as well as possible functional and evolutionary relationships. Recent structural genomics initiatives and other high-throughput experiments have populated the biological databases at a rapid pace. The amount of structural data has made traditional methods such as man...

متن کامل

Feature Selection for Highly Skewed Sentiment Analysis Tasks

2014

Can Liu Sandra Kübler Ning Yu

Sentiment analysis generally uses large feature sets based on a bag-of-words approach, which results in a situation where individual features are not very informative. In addition, many data sets tend to be heavily skewed. We approach this combination of challenges by investigating feature selection in order to reduce the large number of features to those that are discriminative. We examine the...

متن کامل

Empirical Study of Bagging Predictors on Medical Data

2011

Guohua Liang Chengqi Zhang

This study investigates the performance of bagging in terms of learning from imbalanced medical data. It is important for data miners to achieve highly accurate prediction models, and this is especially true for imbalanced medical applications. In these situations, practitioners are more interested in the minority class than the majority class; however, it is hard for a traditional supervised l...

متن کامل