New Techniques for Clustering Complex Objects
Beschreibung
vor 20 Jahren
The tremendous amount of data produced nowadays in various
application domains such as molecular biology or geography can only
be fully exploited by efficient and effective data mining tools.
One of the primary data mining tasks is clustering, which is the
task of partitioning points of a data set into distinct groups
(clusters) such that two points from one cluster are similar to
each other whereas two points from distinct clusters are not. Due
to modern database technology, e.g.object relational databases, a
huge amount of complex objects from scientific, engineering or
multimedia applications is stored in database systems. Modelling
such complex data often results in very high-dimensional vector
data ("feature vectors"). In the context of clustering, this causes
a lot of fundamental problems, commonly subsumed under the term
"Curse of Dimensionality". As a result, traditional clustering
algorithms often fail to generate meaningful results, because in
such high-dimensional feature spaces data does not cluster anymore.
But usually, there are clusters embedded in lower dimensional
subspaces, i.e. meaningful clusters can be found if only a certain
subset of features is regarded for clustering. The subset of
features may even be different for varying clusters. In this
thesis, we present original extensions and enhancements of the
density-based clustering notion to cope with high-dimensional data.
In particular, we propose an algorithm called SUBCLU
(density-connected Subspace Clustering) that extends DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) to
the problem of subspace clustering. SUBCLU efficiently computes all
clusters of arbitrary shape and size that would have been found if
DBSCAN were applied to all possible subspaces of the feature space.
Two subspace selection techniques called RIS (Ranking Interesting
Subspaces) and SURFING (SUbspaces Relevant For clusterING) are
proposed. They do not compute the subspace clusters directly, but
generate a list of subspaces ranked by their clustering
characteristics. A hierarchical clustering algorithm can be applied
to these interesting subspaces in order to compute a hierarchical
(subspace) clustering. In addition, we propose the algorithm 4C
(Computing Correlation Connected Clusters) that extends the
concepts of DBSCAN to compute density-based correlation clusters.
4C searches for groups of objects which exhibit an arbitrary but
uniform correlation. Often, the traditional approach of modelling
data as high-dimensional feature vectors is no longer able to
capture the intuitive notion of similarity between complex objects.
Thus, objects like chemical compounds, CAD drawings, XML data or
color images are often modelled by using more complex
representations like graphs or trees. If a metric distance function
like the edit distance for graphs and trees is used as similarity
measure, traditional clustering approaches like density-based
clustering are applicable to those data. However, we face the
problem that a single distance calculation can be very expensive.
As clustering performs a lot of distance calculations, approaches
like filter and refinement and metric indices get important. The
second part of this thesis deals with special approaches for
clustering in application domains with complex similarity models.
We show, how appropriate filters can be used to enhance the
performance of query processing and, thus, clustering of
hierarchical objects. Furthermore, we describe how the two
paradigms of filtering and metric indexing can be combined. As
complex objects can often be represented by using different
similarity models, a new clustering approach is presented that is
able to cluster objects that provide several different complex
representations.
application domains such as molecular biology or geography can only
be fully exploited by efficient and effective data mining tools.
One of the primary data mining tasks is clustering, which is the
task of partitioning points of a data set into distinct groups
(clusters) such that two points from one cluster are similar to
each other whereas two points from distinct clusters are not. Due
to modern database technology, e.g.object relational databases, a
huge amount of complex objects from scientific, engineering or
multimedia applications is stored in database systems. Modelling
such complex data often results in very high-dimensional vector
data ("feature vectors"). In the context of clustering, this causes
a lot of fundamental problems, commonly subsumed under the term
"Curse of Dimensionality". As a result, traditional clustering
algorithms often fail to generate meaningful results, because in
such high-dimensional feature spaces data does not cluster anymore.
But usually, there are clusters embedded in lower dimensional
subspaces, i.e. meaningful clusters can be found if only a certain
subset of features is regarded for clustering. The subset of
features may even be different for varying clusters. In this
thesis, we present original extensions and enhancements of the
density-based clustering notion to cope with high-dimensional data.
In particular, we propose an algorithm called SUBCLU
(density-connected Subspace Clustering) that extends DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) to
the problem of subspace clustering. SUBCLU efficiently computes all
clusters of arbitrary shape and size that would have been found if
DBSCAN were applied to all possible subspaces of the feature space.
Two subspace selection techniques called RIS (Ranking Interesting
Subspaces) and SURFING (SUbspaces Relevant For clusterING) are
proposed. They do not compute the subspace clusters directly, but
generate a list of subspaces ranked by their clustering
characteristics. A hierarchical clustering algorithm can be applied
to these interesting subspaces in order to compute a hierarchical
(subspace) clustering. In addition, we propose the algorithm 4C
(Computing Correlation Connected Clusters) that extends the
concepts of DBSCAN to compute density-based correlation clusters.
4C searches for groups of objects which exhibit an arbitrary but
uniform correlation. Often, the traditional approach of modelling
data as high-dimensional feature vectors is no longer able to
capture the intuitive notion of similarity between complex objects.
Thus, objects like chemical compounds, CAD drawings, XML data or
color images are often modelled by using more complex
representations like graphs or trees. If a metric distance function
like the edit distance for graphs and trees is used as similarity
measure, traditional clustering approaches like density-based
clustering are applicable to those data. However, we face the
problem that a single distance calculation can be very expensive.
As clustering performs a lot of distance calculations, approaches
like filter and refinement and metric indices get important. The
second part of this thesis deals with special approaches for
clustering in application domains with complex similarity models.
We show, how appropriate filters can be used to enhance the
performance of query processing and, thus, clustering of
hierarchical objects. Furthermore, we describe how the two
paradigms of filtering and metric indexing can be combined. As
complex objects can often be represented by using different
similarity models, a new clustering approach is presented that is
able to cluster objects that provide several different complex
representations.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)