Correlation Clustering
Beschreibung
vor 16 Jahren
Knowledge Discovery in Databases (KDD) is the non-trivial process
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. The core step of the KDD process
is the application of a Data Mining algorithm in order to produce a
particular enumeration of patterns and relationships in large
databases. Clustering is one of the major data mining techniques
and aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. This can serve to group customers with similar
interests, or to group genes with related functionalities.
Currently, a challenge for clustering-techniques are especially
high dimensional feature-spaces. Due to modern facilities of data
collection, real data sets usually contain many features. These
features are often noisy or exhibit correlations among each other.
However, since these effects in different parts of the data set are
differently relevant, irrelevant features cannot be discarded in
advance. The selection of relevant features must therefore be
integrated into the data mining technique. Since about 10 years,
specialized clustering approaches have been developed to cope with
problems in high dimensional data better than classic clustering
approaches. Often, however, the different problems of very
different nature are not distinguished from one another. A main
objective of this thesis is therefore a systematic classification
of the diverse approaches developed in recent years according to
their task definition, their basic strategy, and their algorithmic
approach. We discern as main categories the search for clusters (i)
w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t.
common behavior (patterns) of objects in axis-parallel subspaces,
and (iii) w.r.t. closeness of objects in arbitrarily oriented
subspaces (so called correlation cluster). For the third category,
the remaining parts of the thesis describe novel approaches. A
first approach is the adaptation of density-based clustering to the
problem of correlation clustering. The starting point here is the
first density-based approach in this field, the algorithm 4C.
Subsequently, enhancements and variations of this approach are
discussed allowing for a more robust, more efficient, or more
effective behavior or even find hierarchies of correlation clusters
and the corresponding subspaces. The density-based approach to
correlation clustering, however, is fundamentally unable to solve
some issues since an analysis of local neighborhoods is required.
This is a problem in high dimensional data. Therefore, a novel
method is proposed tackling the correlation clustering problem in a
global approach. Finally, a method is proposed to derive models for
correlation clusters to allow for an interpretation of the clusters
and facilitate more thorough analysis in the corresponding domain
science. Finally, possible applications of these models are
proposed and discussed.
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. The core step of the KDD process
is the application of a Data Mining algorithm in order to produce a
particular enumeration of patterns and relationships in large
databases. Clustering is one of the major data mining techniques
and aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. This can serve to group customers with similar
interests, or to group genes with related functionalities.
Currently, a challenge for clustering-techniques are especially
high dimensional feature-spaces. Due to modern facilities of data
collection, real data sets usually contain many features. These
features are often noisy or exhibit correlations among each other.
However, since these effects in different parts of the data set are
differently relevant, irrelevant features cannot be discarded in
advance. The selection of relevant features must therefore be
integrated into the data mining technique. Since about 10 years,
specialized clustering approaches have been developed to cope with
problems in high dimensional data better than classic clustering
approaches. Often, however, the different problems of very
different nature are not distinguished from one another. A main
objective of this thesis is therefore a systematic classification
of the diverse approaches developed in recent years according to
their task definition, their basic strategy, and their algorithmic
approach. We discern as main categories the search for clusters (i)
w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t.
common behavior (patterns) of objects in axis-parallel subspaces,
and (iii) w.r.t. closeness of objects in arbitrarily oriented
subspaces (so called correlation cluster). For the third category,
the remaining parts of the thesis describe novel approaches. A
first approach is the adaptation of density-based clustering to the
problem of correlation clustering. The starting point here is the
first density-based approach in this field, the algorithm 4C.
Subsequently, enhancements and variations of this approach are
discussed allowing for a more robust, more efficient, or more
effective behavior or even find hierarchies of correlation clusters
and the corresponding subspaces. The density-based approach to
correlation clustering, however, is fundamentally unable to solve
some issues since an analysis of local neighborhoods is required.
This is a problem in high dimensional data. Therefore, a novel
method is proposed tackling the correlation clustering problem in a
global approach. Finally, a method is proposed to derive models for
correlation clusters to allow for an interpretation of the clusters
and facilitate more thorough analysis in the corresponding domain
science. Finally, possible applications of these models are
proposed and discussed.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)