imbalanced data sets

Machine Condition Monitoring and Fault Diagnostics with Imbalanced Data Sets based on the KDD Process

Journal: :IFAC-PapersOnLine 2016

An experimental comparison of classification techniques for imbalanced credit scoring data sets using SAS® Enterprise MinerTM

2012

Iain Brown

In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regre...

متن کامل

Online Imbalanced Support Vector Machine for Phishing Emails Filtering

2014

XiaoQing Gu TongGuang Ni Wei Wang

Phishing emails are a real threat to internet communication and web economy. In real-world emails datasets, data are predominately composed of ham samples with only a small percentage of phishing ones. Standard Support Vector Machine (SVM) could produce suboptimal results in filtering phishing emails, and it often requires much time to perform the classification for large data sets. In this pap...

متن کامل

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Journal: :Mathematical Problems in Engineering 2018

متن کامل

Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets

Journal: :Journal of Chemical Information and Modeling 2017

متن کامل

An experimental comparison of classification algorithms for imbalanced credit scoring data sets

Journal: :Expert Syst. Appl. 2012

Iain Brown Christophe Mues

In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regre...

متن کامل

Kernel Based Asymmetric Learning for Software Defect Prediction

Journal: :IEICE Transactions 2012

Ying Ma Guangchun Luo Hao Chen

Software defect prediction is to predict the defect-prone modules for the next release of software or cross project software. Real world data mining applications, including software defect prediction domain, must address the issue of learning from imbalanced data sets. As pointed out by Khoshgoftaar et al. [1] and Menzies et al. [2], the majority of defects in a software system are located in a...

متن کامل

Learning from Skewed Class Multi-relational Databases

Journal: :Fundam. Inform. 2008

Hongyu Guo Herna L. Viktor

Relational databases, with vast amounts of data–from financial transactions, marketing surveys, medical records, to health informatics observations– and complex schemas, are ubiquitous in our society. Multirelational classification algorithms have been proposed to learn from such relational repositories, where multiple interconnected tables (relations) are involved. These methods search for rel...

متن کامل

SMOTEBoost: Improving Prediction of the Minority Class in Boosting

2003

Nitesh V. Chawla Aleksandar Lazarevic Lawrence O. Hall Kevin W. Bowyer

Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic Minority Over-sampling TEchnique) is spe...

متن کامل

A Genetic Algorithm for Feature Selection and Granularity Learning in Fuzzy Rule-Based Classification Systems for Highly Imbalanced Data-Sets

2010

Pedro Villar Alberto Fernández Francisco Herrera

This contribution proposes a Genetic Algorithm for jointly performing a feature selection and granularity learning for Fuzzy RuleBased Classification Systems in the scenario of data-sets with a high imbalance degree. We refer to imbalanced data-sets when the class distribution is not uniform, a situation that it is present in many real application areas. The aim of this work is to get more comp...

متن کامل