Beschreibung

vor 13 Jahren
The last years have seen a tremendous increase of data acquisition
in different scientific fields such as molecular biology,
bioinformatics or biomedicine. Therefore, novel methods are needed
for automatic data processing and analysis of this large amount of
data. Data mining is the process of applying methods like
clustering or classification to large databases in order to uncover
hidden patterns. Clustering is the task of partitioning points of a
data set into distinct groups in order to minimize the intra
cluster similarity and to maximize the inter cluster similarity. In
contrast to unsupervised learning like clustering, the
classification problem is known as supervised learning that aims at
the prediction of group membership of data objects on the basis of
rules learned from a training set where the group membership is
known. Specialized methods have been proposed for hierarchical and
partitioning clustering. However, these methods suffer from several
drawbacks. In the first part of this work, new clustering methods
are proposed that cope with problems from conventional clustering
algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a
hierarchical clustering method that is based on a hierarchical
variant of the Minimum Description Length (MDL) principle which
finds hierarchies of clusters without requiring input parameters.
As ITCH may converge only to a local optimum we propose GACH
(Genetic Algorithm for Finding Cluster Hierarchies) that combines
the benefits from genetic algorithms with information-theory. In
this way the search space is explored more effectively.
Furthermore, we propose INTEGRATE a novel clustering method for
data with mixed numerical and categorical attributes. Supported by
the MDL principle our method integrates the information provided by
heterogeneous numerical and categorical attributes and thus
naturally balances the influence of both sources of information. A
competitive evaluation illustrates that INTEGRATE is more effective
than existing clustering methods for mixed type data. Besides
clustering methods for single data objects we provide a solution
for clustering different data sets that are represented by their
skylines. The skyline operator is a well-established database
primitive for finding database objects which minimize two or more
attributes with an unknown weighting between these attributes. In
this thesis, we define a similarity measure, called SkyDist, for
comparing skylines of different data sets that can directly be
integrated into different data mining tasks such as clustering or
classification. The experiments show that SkyDist in combination
with different clustering algorithms can give useful insights into
many applications. In the second part, we focus on the analysis of
high resolution magnetic resonance images (MRI) that are clinically
relevant and may allow for an early detection and diagnosis of
several diseases. In particular, we propose a framework for the
classification of Alzheimer's disease in MR images combining the
data mining steps of feature selection, clustering and
classification. As a result, a set of highly selective features
discriminating patients with Alzheimer and healthy people has been
identified. However, the analysis of the high dimensional MR images
is extremely time-consuming. Therefore we developed JGrid, a
scalable distributed computing solution designed to allow for a
large scale analysis of MRI and thus an optimized prediction of
diagnosis. In another study we apply efficient algorithms for motif
discovery to task-fMRI scans in order to identify patterns in the
brain that are characteristic for patients with somatoform pain
disorder. We find groups of brain compartments that occur
frequently within the brain networks and discriminate well among
healthy and diseased people.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: