New techniques for clustering complex objects
نویسنده
چکیده
The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g. object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very high-dimensional vector data (”feature vectors”). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term ”Curse of Dimensionality”. As a result, traditional clustering algorithms often fail to generate meaningful results, because in such high-dimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high-dimensional data. In particular, we propose an algorithm called SUBCLU (density-connected Subspace Clustering) that extends DBSCAN (Density-Based Spatial C lustering of Applications with N oise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces
منابع مشابه
A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملHierarchical Clustering in Object Oriented Data Models with Complex Class Relationships
Class fragmentation is an essential phase in the design of Distributed Object Oriented Databases (DOODB). Horizontal and vertical fragmentation are the two commonly used fragmentation techniques. We propose here two new methods for horizontal fragmentation of objects with complex attributes. They rely on AI clustering techniques for grouping objects into fragments. Both methods take into accoun...
متن کاملComparing Model-based Versus K-means Clustering for the Planar Shapes
In some fields, there is an interest in distinguishing different geometrical objects from each other. A field of research that studies the objects from a statistical point of view, provided they are invariant under translation, rotation and scaling effects, is known as the statistical shape analysis. Having some objects that are registered using key points on the outline...
متن کاملClustering and Classification on Uncertain Data
We study the problem of mining on uncertain objects whose locations are uncertain and described by probability density functions (pdf). Clustering and classification are two important tasks in data mining. Clustering on uncertain objects is different from traditional case on certain objects. UK-means is proposed based on K-means but it is time consuming. Pruning techniques are proposed to impro...
متن کاملRobust Method for E-Maximization and Hierarchical Clustering of Image Classification
We developed a new semi-supervised EM-like algorithm that is given the set of objects present in eachtraining image, but does not know which regions correspond to which objects. We have tested thealgorithm on a dataset of 860 hand-labeled color images using only color and texture features, and theresults show that our EM variant is able to break the symmetry in the initial solution. We compared...
متن کاملانتخاب اعضای ترکیب در خوشهبندی ترکیبی با استفاده از رأیگیری
Clustering is the process of division of a dataset into subsets that are called clusters, so that objects within a cluster are similar to each other and different from objects of the other clusters. So far, a lot of algorithms in different approaches have been created for the clustering. An effective choice (can combine) two or more of these algorithms for solving the clustering problem. Ensemb...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004