imbalanced data

Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

2015

Zejin Ding ZEJIN DING YANQING ZHANG

In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in t...

متن کامل

First study of the behaviour of genetic fuzzy classifier based on low quality data respect to the preprocessing of low quality imbalanced datasets

2010

Ana M. Palacios Luciano Sánchez Inés Couso

There are real-world dataset where we can found classes with a very different percentage of patterns between them, that is to say we have classes represented by many examples (high percentage of patterns) and classes represented by few examples (low percentage of patterns). These kind of datasets receive the name of “imbalanced datasets”. In the field of classification problems the imbalanced d...

متن کامل

A study of data pre-processing techniques for imbalanced biomedical data classification

Journal: :International Journal of Bioinformatics Research and Applications 2020

متن کامل

PDFOS: PDF estimation based over-sampling for imbalanced two-class problems

Journal: :Neurocomputing 2014

Ming Gao Xia Hong Sheng Chen Christopher J. Harris Emad Khalaf

This contribution proposes a novel probability density function (PDF) estimation based over-sampling (PDFOS) approach for two-class imbalanced classification problems. The classical Parzen-window kernel function is adopted to estimate the PDF of the positive class. Then according to the estimated PDF, synthetic instances are generated as the additional training data. The essential concept is to...

متن کامل

Training Neural Networks for Protein Secondary Structure Prediction: The Effects of Imbalanced Data Set

2009

Viviane Palodeto Hernán Terenzi Jefferson Luiz Brum Marques

Protein secondary structure prediction (PSSP) is one of the main tasks in computational biology. During the last few decades, much effort has been made towards solving this problem, with various approaches, mainly artificial neural networks (ANN). Generally, in order to predict the protein secondary structure, the ANN training process is performed using CB513 data set. Like protein structures d...

متن کامل

Adaptive Oversampling for Imbalanced Data Classification

2013

Seyda Ertekin

Data imbalance is known to significantly hinder the generalization performance of supervised learning algorithms. A common strategy to overcome this challenge is synthetic oversampling, where synthetic minority class examples are generated to balance the distribution between the examples of the majority and minority classes. We present a novel adaptive oversampling algorithm, VIRTUAL, that comb...

متن کامل

Neighbourhood sampling in bagging for imbalanced data

Journal: :Neurocomputing 2015

Jerzy Blaszczynski Jerzy Stefanowski

Various approaches to extend bagging ensembles for class imbalanced data are considered. First, we review known extensions and compare them in a comprehensive experimental study. The results show that integrating bagging with under-sampling is more powerful than over-sampling. They also allow to distinguish Roughly Balanced Bagging as the most accurate extension. Then, we point out that complex...

متن کامل

Actively Balanced Bagging for Imbalanced Data

2017

Jerzy Blaszczynski Jerzy Stefanowski

Under-sampling extensions of bagging are currently the most accurate ensembles specialized for class imbalanced data. Nevertheless, since improvements of recognition of the minority class, in this type of ensembles, are usually associated with a decrease of recognition of majority classes, we introduce a new, two phase, ensemble called Actively Balanced Bagging. The proposal is to first learn a...

متن کامل

Knowledge Assisted Visualization for Imbalanced Data Clustering

2013

P. Alagambigai K. Thangavel Ashok Kumar

The common challenge which is faced by much of the data clustering techniques is data complexity, which leads to many issues such as overlapping, lack of representative data and class imbalance. This may deteriorates the clustering process. The situation gets worse when the class imbalance is very high. To cluster such imbalanced data sets, better understandings of the dataset and efficient clu...

متن کامل

ForesTexter: An efficient random forest algorithm for imbalanced text categorization

Journal: :Knowl.-Based Syst. 2014

Qingyao Wu Yunming Ye Haijun Zhang Michael K. Ng Shen-Shyang Ho

In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in b...

متن کامل