Efficient Knowledge Extraction from Structured Data
Beschreibung
vor 12 Jahren
Knowledge extraction from structured data aims for identifying
valid, novel, potentially useful, and ultimately understandable
patterns in the data. The core step of this process is the
application of a data mining algorithm in order to produce an
enumeration of particular patterns and relationships in large
databases. Clustering is one of the major data mining tasks and
aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. In this thesis, we advance the state-of-the-art data
mining algorithms for analyzing structured data types. We describe
the development of innovative solutions for hierarchical data
mining. The EM-based hierarchical clustering method ITCH
(Information-Theoretic Cluster Hierarchies) is designed to propose
solid solutions for four different challenges. (1) to guide the
hierarchical clustering algorithm to identify only meaningful and
valid clusters. (2) to represent each cluster content in the
hierarchy by an intuitive description with e.g. a probability
density function. (3) to consistently handle outliers. (4) to avoid
difficult parameter settings. ITCH is built on a hierarchical
variant of the information-theoretic principle of Minimum
Description Length (MDL). Interpreting the hierarchical cluster
structure as a statistical model of the dataset, it can be used for
effective data compression by Huffman coding. Thus, the achievable
compression rate induces a natural objective function for
clustering, which automatically satisfies all four above mentioned
goals. The genetic-based hierarchical clustering algorithm GACH
(Genetic Algorithm for finding Cluster Hierarchies) overcomes the
problem of getting stuck in a local optimum by a beneficial
combination of genetic algorithms, information theory and
model-based clustering. Besides hierarchical data mining, we also
made contributions to more complex data structures, namely objects
that consist of mixed type attributes and skyline objects. The
algorithm INTEGRATE performs integrative mining of heterogeneous
data, which is one of the major challenges in the next decade, by a
unified view on numerical and categorical information in
clustering. Once more, supported by the MDL principle, INTEGRATE
guarantees the usability on real world data. For skyline objects we
developed SkyDist, a similarity measure for comparing different
skyline objects, which is therefore a first step towards performing
data mining on this kind of data structure. Applied in a
recommender system, for example SkyDist can be used for pointing
the user to alternative car types, exhibiting a similar
price/mileage behavior like in his original query. For mining
graph-structured data, we developed different approaches that have
the ability to detect patterns in static as well as in dynamic
networks. We confirmed the practical feasibility of our novel
approaches on large real-world case studies ranging from medical
brain data to biological yeast networks. In the second part of this
thesis, we focused on boosting the knowledge extraction process. We
achieved this objective by an intelligent adoption of Graphics
Processing Units (GPUs). The GPUs have evolved from simple devices
for the display signal preparation into powerful coprocessors that
do not only support typical computer graphics tasks but can also be
used for general numeric and symbolic computations. As major
advantage, GPUs provide extreme parallelism combined with a high
bandwidth in memory transfer at low cost. In this thesis, we
propose algorithms for computationally expensive data mining tasks
like similarity search and different clustering paradigms which are
designed for the highly parallel environment of a GPU, called
CUDA-DClust and CUDA-k-means. We define a multi-dimensional index
structure which is particularly suited to support similarity
queries under the restricted programming model of a GPU. We
demonstrate the superiority of our algorithms running on GPU over
their conventional counterparts on CPU in terms of efficiency.
valid, novel, potentially useful, and ultimately understandable
patterns in the data. The core step of this process is the
application of a data mining algorithm in order to produce an
enumeration of particular patterns and relationships in large
databases. Clustering is one of the major data mining tasks and
aims at grouping the data objects into meaningful classes
(clusters) such that the similarity of objects within clusters is
maximized, and the similarity of objects from different clusters is
minimized. In this thesis, we advance the state-of-the-art data
mining algorithms for analyzing structured data types. We describe
the development of innovative solutions for hierarchical data
mining. The EM-based hierarchical clustering method ITCH
(Information-Theoretic Cluster Hierarchies) is designed to propose
solid solutions for four different challenges. (1) to guide the
hierarchical clustering algorithm to identify only meaningful and
valid clusters. (2) to represent each cluster content in the
hierarchy by an intuitive description with e.g. a probability
density function. (3) to consistently handle outliers. (4) to avoid
difficult parameter settings. ITCH is built on a hierarchical
variant of the information-theoretic principle of Minimum
Description Length (MDL). Interpreting the hierarchical cluster
structure as a statistical model of the dataset, it can be used for
effective data compression by Huffman coding. Thus, the achievable
compression rate induces a natural objective function for
clustering, which automatically satisfies all four above mentioned
goals. The genetic-based hierarchical clustering algorithm GACH
(Genetic Algorithm for finding Cluster Hierarchies) overcomes the
problem of getting stuck in a local optimum by a beneficial
combination of genetic algorithms, information theory and
model-based clustering. Besides hierarchical data mining, we also
made contributions to more complex data structures, namely objects
that consist of mixed type attributes and skyline objects. The
algorithm INTEGRATE performs integrative mining of heterogeneous
data, which is one of the major challenges in the next decade, by a
unified view on numerical and categorical information in
clustering. Once more, supported by the MDL principle, INTEGRATE
guarantees the usability on real world data. For skyline objects we
developed SkyDist, a similarity measure for comparing different
skyline objects, which is therefore a first step towards performing
data mining on this kind of data structure. Applied in a
recommender system, for example SkyDist can be used for pointing
the user to alternative car types, exhibiting a similar
price/mileage behavior like in his original query. For mining
graph-structured data, we developed different approaches that have
the ability to detect patterns in static as well as in dynamic
networks. We confirmed the practical feasibility of our novel
approaches on large real-world case studies ranging from medical
brain data to biological yeast networks. In the second part of this
thesis, we focused on boosting the knowledge extraction process. We
achieved this objective by an intelligent adoption of Graphics
Processing Units (GPUs). The GPUs have evolved from simple devices
for the display signal preparation into powerful coprocessors that
do not only support typical computer graphics tasks but can also be
used for general numeric and symbolic computations. As major
advantage, GPUs provide extreme parallelism combined with a high
bandwidth in memory transfer at low cost. In this thesis, we
propose algorithms for computationally expensive data mining tasks
like similarity search and different clustering paradigms which are
designed for the highly parallel environment of a GPU, called
CUDA-DClust and CUDA-k-means. We define a multi-dimensional index
structure which is particularly suited to support similarity
queries under the restricted programming model of a GPU. We
demonstrate the superiority of our algorithms running on GPU over
their conventional counterparts on CPU in terms of efficiency.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)