Algorithms for Mining Distance-Based Outliers in Large Datasets
نویسندگان
چکیده
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers in large datasets can only deal efficiently with two dimensions/attributes of a dataset. Here, we study the notion of DB(DistanceBased) outliers. While we provide formal and empirical evidence showing the usefulness of DB-outliers, we focus on the development of algorithms for computing such outliers. First, we present two simple algorithms, both having a complexity of O(k N’), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear wrt N, but exponential wrt k. Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed fOT direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, OT to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 24th VLDB Conference New York, USA, 1998 experimental results showing that these cellbased algorithms are by far the best for k 5 4.
منابع مشابه
Assessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملGrid-ODF: Detecting Outliers Effectively and Efficiently in Large Multi-dimensional Databases
Outlier detection is an important task in data mining that enjoys a wide range of applications such as detections of credit card fraud, criminal activity and exceptional patterns in databases. In recent years, there have been numerous research work in outlier detection and the new notions such as distance-based outliers and density-based local outliers have been proposed. However, the existing ...
متن کاملAn Effective Approach for Robust Metric Learning in the Presence of Label Noise
Many algorithms in machine learning, pattern recognition, and data mining are based on a similarity/distance measure. For example, the kNN classifier and clustering algorithms such as k-means require a similarity/distance function. Also, in Content-Based Information Retrieval (CBIR) systems, we need to rank the retrieved objects based on the similarity to the query. As generic measures such as ...
متن کاملIdentification of Structural Defects Using Computer Algorithms
One of the numerous methods recently employed to study the health of structures is the identification of anomaly in data obtained for the condition of the structure, e.g. the frequencies for the structural modes, stress, strain, displacement, speed, and acceleration) which are obtained and stored by various sensors. The methods of identification applied for anomalies attempt to discover and re...
متن کاملSome issues about outlier detection in rough set theory
‘‘One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distancebased outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998