Advanced Data Mining Techniques for Compound Objects
Beschreibung
vor 20 Jahren
Knowledge Discovery in Databases (KDD) is the non-trivial process
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in large data collections. The most
important step within the process of KDD is data mining which is
concerned with the extraction of the valid patterns. KDD is
necessary to analyze the steady growing amount of data caused by
the enhanced performance of modern computer systems. However, with
the growing amount of data the complexity of data objects increases
as well. Modern methods of KDD should therefore examine more
complex objects than simple feature vectors to solve real-world KDD
applications adequately. Multi-instance and multi-represented
objects are two important types of object representations for
complex objects. Multi-instance objects consist of a set of object
representations that all belong to the same feature space.
Multi-represented objects are constructed as a tuple of feature
representations where each feature representation belongs to a
different feature space. The contribution of this thesis is the
development of new KDD methods for the classification and
clustering of complex objects. Therefore, the thesis introduces
solutions for real-world applications that are based on
multi-instance and multi-represented object representations. On the
basis of these solutions, it is shown that a more general object
representation often provides better results for many relevant KDD
applications. The first part of the thesis is concerned with two
KDD problems for which employing multi-instance objects provides
efficient and effective solutions. The first is the data mining in
CAD parts, e.g. the use of hierarchic clustering for the automatic
construction of product hierarchies. The introduced solution
decomposes a single part into a set of feature vectors and compares
them by using a metric on multi-instance objects. Furthermore,
multi-step query processing using a novel filter step is employed,
enabling the user to efficiently process similarity queries. On the
basis of this similarity search system, it is possible to perform
several distance based data mining algorithms like the hierarchical
clustering algorithm OPTICS to derive product hierarchies. The
second important application is the classification and search for
complete websites in the world wide web (WWW). A website is a set
of HTML-documents that is published by the same person, group or
organization and usually serves a common purpose. To perform data
mining for websites, the thesis presents several methods to
classify websites. After introducing naive methods modelling
websites as webpages, two more sophisticated approaches to website
classification are introduced. The first approach uses a
preprocessing that maps single HTML-documents within each website
to so-called page classes. The second approach directly compares
websites as sets of word vectors and uses nearest neighbor
classification. To search the WWW for new, relevant websites, a
focused crawler is introduced that efficiently retrieves relevant
websites. This crawler minimizes the number of HTML-documents and
increases the accuracy of website retrieval. The second part of the
thesis is concerned with the data mining in multi-represented
objects. An important example application for this kind of complex
objects are proteins that can be represented as a tuple of a
protein sequence and a text annotation. To analyze
multi-represented objects, a clustering method for
multi-represented objects is introduced that is based on the
density based clustering algorithm DBSCAN. This method uses all
representations that are provided to find a global clustering of
the given data objects. However, in many applications there already
exists a sophisticated class ontology for the given data objects,
e.g. proteins. To map new objects into an ontology a new method for
the hierarchical classification of multi-represented objects is
described. The system employs the hierarchical structure of the
ontology to efficiently classify new proteins, using support vector
machines.
of identifying valid, novel, potentially useful, and ultimately
understandable patterns in large data collections. The most
important step within the process of KDD is data mining which is
concerned with the extraction of the valid patterns. KDD is
necessary to analyze the steady growing amount of data caused by
the enhanced performance of modern computer systems. However, with
the growing amount of data the complexity of data objects increases
as well. Modern methods of KDD should therefore examine more
complex objects than simple feature vectors to solve real-world KDD
applications adequately. Multi-instance and multi-represented
objects are two important types of object representations for
complex objects. Multi-instance objects consist of a set of object
representations that all belong to the same feature space.
Multi-represented objects are constructed as a tuple of feature
representations where each feature representation belongs to a
different feature space. The contribution of this thesis is the
development of new KDD methods for the classification and
clustering of complex objects. Therefore, the thesis introduces
solutions for real-world applications that are based on
multi-instance and multi-represented object representations. On the
basis of these solutions, it is shown that a more general object
representation often provides better results for many relevant KDD
applications. The first part of the thesis is concerned with two
KDD problems for which employing multi-instance objects provides
efficient and effective solutions. The first is the data mining in
CAD parts, e.g. the use of hierarchic clustering for the automatic
construction of product hierarchies. The introduced solution
decomposes a single part into a set of feature vectors and compares
them by using a metric on multi-instance objects. Furthermore,
multi-step query processing using a novel filter step is employed,
enabling the user to efficiently process similarity queries. On the
basis of this similarity search system, it is possible to perform
several distance based data mining algorithms like the hierarchical
clustering algorithm OPTICS to derive product hierarchies. The
second important application is the classification and search for
complete websites in the world wide web (WWW). A website is a set
of HTML-documents that is published by the same person, group or
organization and usually serves a common purpose. To perform data
mining for websites, the thesis presents several methods to
classify websites. After introducing naive methods modelling
websites as webpages, two more sophisticated approaches to website
classification are introduced. The first approach uses a
preprocessing that maps single HTML-documents within each website
to so-called page classes. The second approach directly compares
websites as sets of word vectors and uses nearest neighbor
classification. To search the WWW for new, relevant websites, a
focused crawler is introduced that efficiently retrieves relevant
websites. This crawler minimizes the number of HTML-documents and
increases the accuracy of website retrieval. The second part of the
thesis is concerned with the data mining in multi-represented
objects. An important example application for this kind of complex
objects are proteins that can be represented as a tuple of a
protein sequence and a text annotation. To analyze
multi-represented objects, a clustering method for
multi-represented objects is introduced that is based on the
density based clustering algorithm DBSCAN. This method uses all
representations that are provided to find a global clustering of
the given data objects. However, in many applications there already
exists a sophisticated class ontology for the given data objects,
e.g. proteins. To map new objects into an ontology a new method for
the hierarchical classification of multi-represented objects is
described. The system employs the hierarchical structure of the
ontology to efficiently classify new proteins, using support vector
machines.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)