Coping With New Challengens for Density-Based Clustering
Beschreibung
vor 20 Jahren
Knowledge Discovery in Databases (KDD) is the non-trivial process
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. The core step of the KDD process
is the application of a Data Mining algorithm in order to produce a
particular enumeration of patterns and relationships in large
databases. Clustering is one of the major data mining tasks and
aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. Beside many others, the density-based clustering notion
underlying the algorithm DBSCAN and its hierarchical extension
OPTICS has been proposed recently, being one of the most successful
approaches to clustering. In this thesis, our aim is to advance the
state-of-the-art clustering, especially density-based clustering by
identifying novel challenges for density-based clustering and
proposing innovative and solid solutions for these challenges. We
describe the development of the industrial prototype BOSS (Browsing
OPTICS plots for Similarity Search) which is a first step towards
developing a comprehensive, scalable and distributed computing
solution designed to make the efficiency and analytical
capabilities of OPTICS available to a broader audience. For the
development of BOSS, several key enhancements of OPTICS are
required which are addressed in this thesis. We develop incremental
algorithms of OPTICS to efficiently reconstruct the hierarchical
clustering structure in frequently updated databases, in
particular, when a set of objects is inserted in or deleted from
the database. We empirically show that these incremental algorithms
yield significant speed-up factors over the original OPTICS
algorithm. Furthermore, we propose a novel algorithm for automatic
extraction of clusters from hierarchical clustering representations
that outperforms comparative methods, and introduce two novel
approaches for selecting meaningful representatives, using the
density-based concepts of OPTICS and producing better results than
the related medoid approach. Another major challenge for
density-based clustering is to cope with high dimensional data.
Many today's real-world data sets contain a large number of
measurements (or features) for a single data object. Usually,
global feature reduction techniques cannot be applied to these data
sets. Thus, the task of feature selection must be combined with and
incooperated into the clustering process. In this thesis, we
present original extensions and enhancements of the density-based
clustering notion to cope with high dimensional data. In
particular, we propose an algorithm called SUBCLU (density based
SUBspace CLUstering) that extends DBSCAN to the problem of subspace
clustering. SUBCLU efficiently computes all clusters that would
have been found if DBSCAN is applied to all possible subspaces of
the feature space. An experimental evaluation on real-world data
sets illustrates that SUBCLU is more effective than existing
subspace clustering algorithms because it is able to find clusters
of arbitrary size and shape, and produces determine results. A
semi-hierarchical extension of SUBCLU called RIS (Ranking
Interesting Subspaces) is proposed that does not compute the
subspace clusters directly, but generates a list of subspaces
ranked by their clustering characteristics. A hierarchical
clustering algorithm can be applied to these interesting subspaces
in order to compute a hierarchical (subspace) clustering. A
comparative evaluation of RIS and SUBCLU shows that RIS in
combination with OPTICS can achieve an information gain over
SUBCLU. In addition, we propose the algorithm 4C (Computing
Correlation Connected Clusters) that extends the concepts of DBSCAN
to compute density-based correlation clusters. 4C benefits from an
innovative, well-defined and effective clustering model,
outperforming related approaches in terms of clustering quality on
real-world data sets.
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. The core step of the KDD process
is the application of a Data Mining algorithm in order to produce a
particular enumeration of patterns and relationships in large
databases. Clustering is one of the major data mining tasks and
aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. Beside many others, the density-based clustering notion
underlying the algorithm DBSCAN and its hierarchical extension
OPTICS has been proposed recently, being one of the most successful
approaches to clustering. In this thesis, our aim is to advance the
state-of-the-art clustering, especially density-based clustering by
identifying novel challenges for density-based clustering and
proposing innovative and solid solutions for these challenges. We
describe the development of the industrial prototype BOSS (Browsing
OPTICS plots for Similarity Search) which is a first step towards
developing a comprehensive, scalable and distributed computing
solution designed to make the efficiency and analytical
capabilities of OPTICS available to a broader audience. For the
development of BOSS, several key enhancements of OPTICS are
required which are addressed in this thesis. We develop incremental
algorithms of OPTICS to efficiently reconstruct the hierarchical
clustering structure in frequently updated databases, in
particular, when a set of objects is inserted in or deleted from
the database. We empirically show that these incremental algorithms
yield significant speed-up factors over the original OPTICS
algorithm. Furthermore, we propose a novel algorithm for automatic
extraction of clusters from hierarchical clustering representations
that outperforms comparative methods, and introduce two novel
approaches for selecting meaningful representatives, using the
density-based concepts of OPTICS and producing better results than
the related medoid approach. Another major challenge for
density-based clustering is to cope with high dimensional data.
Many today's real-world data sets contain a large number of
measurements (or features) for a single data object. Usually,
global feature reduction techniques cannot be applied to these data
sets. Thus, the task of feature selection must be combined with and
incooperated into the clustering process. In this thesis, we
present original extensions and enhancements of the density-based
clustering notion to cope with high dimensional data. In
particular, we propose an algorithm called SUBCLU (density based
SUBspace CLUstering) that extends DBSCAN to the problem of subspace
clustering. SUBCLU efficiently computes all clusters that would
have been found if DBSCAN is applied to all possible subspaces of
the feature space. An experimental evaluation on real-world data
sets illustrates that SUBCLU is more effective than existing
subspace clustering algorithms because it is able to find clusters
of arbitrary size and shape, and produces determine results. A
semi-hierarchical extension of SUBCLU called RIS (Ranking
Interesting Subspaces) is proposed that does not compute the
subspace clusters directly, but generates a list of subspaces
ranked by their clustering characteristics. A hierarchical
clustering algorithm can be applied to these interesting subspaces
in order to compute a hierarchical (subspace) clustering. A
comparative evaluation of RIS and SUBCLU shows that RIS in
combination with OPTICS can achieve an information gain over
SUBCLU. In addition, we propose the algorithm 4C (Computing
Correlation Connected Clusters) that extends the concepts of DBSCAN
to compute density-based correlation clusters. 4C benefits from an
innovative, well-defined and effective clustering model,
outperforming related approaches in terms of clustering quality on
real-world data sets.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)