Beschreibung

vor 21 Jahren
Modern database applications are characterized by two major
aspects: the use of complex data types with internal structure and
the need for new data analysis methods. The focus of database users
has shifted from simple queries to complex analyses of the data,
known as knowledge discovery in databases. Important tasks in this
area are the grouping of data objects (clustering), the
classification of new data objects or the detection of exceptional
data objects (outlier detection). Most algorithms for solving those
problems are based on similarity search in databases. This makes
efficient similarity search in large databases of structured
objects an important basic operation for modern database
applications. In this thesis we develop efficient methods for
similarity search in large databases of structured data and improve
the efficiency of existing query processing techniques. For the
data objects, only a tree or graph structure is assumed which can
be extended with arbitrary attribute information. Starting with an
analysis of the demands from two example applications, several
important requirements for similarity measures are identified. One
aspect is the adaptability of the similarity search method to the
requirements of the user and the application domain. This can even
imply a change of the similarity measure between two successive
queries of the same user. An explanation component which makes
clear why objects are considered similar by the system is a
necessary precondition for a purposeful adaption of the measure.
Consequently, the edit distance, well-known from string processing,
is a common similarity measure for graph structured objects. Its
feature to allow a visualization of corresponding substructures and
the possibility to weight single operations are the reason for this
popularity. But it turns out that the edit distance and similar
measures for tree structures are computationally extremely complex
which makes them unsuitable for today's large and even growing
databases. Therefore, we develop a multi-step query processing
architecture which reduces the number of necessary distance
calculations significantly. This is achieved by employing suitable
filter methods. Furthermore, we show that by easing certain
restrictions on the similarity measure, a significant performance
gain can be obtained without reducing the quality of the measure.
To achieve this, matchings of substructures (vertices or edges) of
the data objects are determined. An additional cost function for
those matchings allows to derive a similarity measure for
structured data, called the edge matching distance, from the cost
optimal matching of the substructures. But even for this new
similarity measure, efficiency can be improved significantly by
using a multi-step query processing approach. This allows the use
of the edge matching distance for knowledge discovery applications
in large databases. Within the thesis, the properties of our new
similarity search methods are proved both theoretically and through
experiments.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: