Integer Programming Relaxations for Integrated Clustering and Outlier Detection
نویسندگان
چکیده
In this paper we present methods for exemplar based clustering with outlier selection based on the facility location formulation. Given a distance function and the number of outliers to be found, the methods automatically determine the number of clusters and outliers. We formulate the problem as an integer program to which we present relaxations that allow for solutions that scale to large data sets. The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable, i.e. it is easier to distinguish between outliers which are the result of data errors from those that may be indicative of a new pattern emergent in the data. We present and contrast three relaxations to the integer program formulation: (i) a linear programming formulation (LP) (ii) an extension of affinity propagation to outlier detection (APOC) and (iii) a Lagrangian duality based formulation (LD). Evaluation on synthetic as well as real data shows the quality and scalability of these different methods.
منابع مشابه
Size Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization
Plain vanilla K-means clustering is prone to produce unbalanced clusters and suffers from outlier sensitivity. To mitigate both shortcomings, we formulate a joint outlier-detection and clustering problem, which assigns a prescribed number of datapoints to an auxiliary outlier cluster and performs cardinality-constrained K-means clustering on the residual dataset. We cast this problem as a mixed...
متن کاملExact Mixed Integer Programming for Integrated Scheduling and Process Planning in Flexible Environment
This paper presented a mixed integer programming for integrated scheduling and process planning. The presented process plan included some orders with precedence relations similar to Multiple Traveling Salesman Problem (MTSP), which was categorized as an NP-hard problem. These types of problems are also called advanced planning because of simultaneously determining the appropriate sequence and m...
متن کاملOutlier Detection for Polynomial Systems Using Semidefinite Relaxations
Outlier detection and analysis is a primary step in modelling towards obtaining unbiased estimates, model validation, and coherent analysis, because outliers may contain valuable information or lead to falsely rejecting hypotheses. In this work, we describe approaches for detecting outliers in measurements due to time-dependent and possibly nonhomogeneously distributed measurement uncertainties...
متن کاملOutlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means
One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...
متن کاملOutlier Detection in Wireless Sensor Networks Using Distributed Principal Component Analysis
Detecting anomalies is an important challenge for intrusion detection and fault diagnosis in wireless sensor networks (WSNs). To address the problem of outlier detection in wireless sensor networks, in this paper we present a PCA-based centralized approach and a DPCA-based distributed energy-efficient approach for detecting outliers in sensed data in a WSN. The outliers in sensed data can be ca...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1403.1329 شماره
صفحات -
تاریخ انتشار 2014