Similarity search and data mining techniques for advanced database systems.
Beschreibung
vor 17 Jahren
Modern automated methods for measurement, collection, and analysis
of data in industry and science are providing more and more data
with drastically increasing structure complexity. On the one hand,
this growing complexity is justified by the need for a richer and
more precise description of real-world objects, on the other hand
it is justified by the rapid progress in measurement and analysis
techniques that allow the user a versatile exploration of objects.
In order to manage the huge volume of such complex data, advanced
database systems are employed. In contrast to conventional database
systems that support exact match queries, the user of these
advanced database systems focuses on applying similarity search and
data mining techniques. Based on an analysis of typical advanced
database systems — such as biometrical, biological, multimedia,
moving, and CAD-object database systems — the following three
challenging characteristics of complexity are detected: uncertainty
(probabilistic feature vectors), multiple instances (a set of
homogeneous feature vectors), and multiple representations (a set
of heterogeneous feature vectors). Therefore, the goal of this
thesis is to develop similarity search and data mining techniques
that are capable of handling uncertain, multi-instance, and
multi-represented objects. The first part of this thesis deals with
similarity search techniques. Object identification is a similarity
search technique that is typically used for the recognition of
objects from image, video, or audio data. Thus, we develop a novel
probabilistic model for object identification. Based on it, two
novel types of identification queries are defined. In order to
process the novel query types efficiently, we introduce an index
structure called Gauss-tree. In addition, we specify further
probabilistic models and query types for uncertain multi-instance
objects and uncertain spatial objects. Based on the index
structure, we develop algorithms for an efficient processing of
these query types. Practical benefits of using probabilistic
feature vectors are demonstrated on a real-world application for
video similarity search. Furthermore, a similarity search technique
is presented that is based on aggregated multi-instance objects,
and that is suitable for video similarity search. This technique
takes multiple representations into account in order to achieve
better effectiveness. The second part of this thesis deals with two
major data mining techniques: clustering and classification. Since
privacy preservation is a very important demand of distributed
advanced applications, we propose using uncertainty for data
obfuscation in order to provide privacy preservation during
clustering. Furthermore, a model-based and a density-based
clustering method for multi-instance objects are developed.
Afterwards, original extensions and enhancements of the
density-based clustering algorithms DBSCAN and OPTICS for handling
multi-represented objects are introduced. Since several advanced
database systems like biological or multimedia database systems
handle predefined, very large class systems, two novel
classification techniques for large class sets that benefit from
using multiple representations are defined. The first
classification method is based on the idea of a k-nearest-neighbor
classifier. It employs a novel density-based technique to reduce
training instances and exploits the entropy impurity of the local
neighborhood in order to weight a given representation. The second
technique addresses hierarchically-organized class systems. It uses
a novel hierarchical, supervised method for the reduction of large
multi-instance objects, e.g. audio or video, and applies support
vector machines for efficient hierarchical classification of
multi-represented objects. User benefits of this technique are
demonstrated by a prototype that performs a classification of large
music collections. The effectiveness and efficiency of all proposed
techniques are discussed and verified by comparison with
conventional approaches in versatile experimental evaluations on
real-world datasets.
of data in industry and science are providing more and more data
with drastically increasing structure complexity. On the one hand,
this growing complexity is justified by the need for a richer and
more precise description of real-world objects, on the other hand
it is justified by the rapid progress in measurement and analysis
techniques that allow the user a versatile exploration of objects.
In order to manage the huge volume of such complex data, advanced
database systems are employed. In contrast to conventional database
systems that support exact match queries, the user of these
advanced database systems focuses on applying similarity search and
data mining techniques. Based on an analysis of typical advanced
database systems — such as biometrical, biological, multimedia,
moving, and CAD-object database systems — the following three
challenging characteristics of complexity are detected: uncertainty
(probabilistic feature vectors), multiple instances (a set of
homogeneous feature vectors), and multiple representations (a set
of heterogeneous feature vectors). Therefore, the goal of this
thesis is to develop similarity search and data mining techniques
that are capable of handling uncertain, multi-instance, and
multi-represented objects. The first part of this thesis deals with
similarity search techniques. Object identification is a similarity
search technique that is typically used for the recognition of
objects from image, video, or audio data. Thus, we develop a novel
probabilistic model for object identification. Based on it, two
novel types of identification queries are defined. In order to
process the novel query types efficiently, we introduce an index
structure called Gauss-tree. In addition, we specify further
probabilistic models and query types for uncertain multi-instance
objects and uncertain spatial objects. Based on the index
structure, we develop algorithms for an efficient processing of
these query types. Practical benefits of using probabilistic
feature vectors are demonstrated on a real-world application for
video similarity search. Furthermore, a similarity search technique
is presented that is based on aggregated multi-instance objects,
and that is suitable for video similarity search. This technique
takes multiple representations into account in order to achieve
better effectiveness. The second part of this thesis deals with two
major data mining techniques: clustering and classification. Since
privacy preservation is a very important demand of distributed
advanced applications, we propose using uncertainty for data
obfuscation in order to provide privacy preservation during
clustering. Furthermore, a model-based and a density-based
clustering method for multi-instance objects are developed.
Afterwards, original extensions and enhancements of the
density-based clustering algorithms DBSCAN and OPTICS for handling
multi-represented objects are introduced. Since several advanced
database systems like biological or multimedia database systems
handle predefined, very large class systems, two novel
classification techniques for large class sets that benefit from
using multiple representations are defined. The first
classification method is based on the idea of a k-nearest-neighbor
classifier. It employs a novel density-based technique to reduce
training instances and exploits the entropy impurity of the local
neighborhood in order to weight a given representation. The second
technique addresses hierarchically-organized class systems. It uses
a novel hierarchical, supervised method for the reduction of large
multi-instance objects, e.g. audio or video, and applies support
vector machines for efficient hierarchical classification of
multi-represented objects. User benefits of this technique are
demonstrated by a prototype that performs a classification of large
music collections. The effectiveness and efficiency of all proposed
techniques are discussed and verified by comparison with
conventional approaches in versatile experimental evaluations on
real-world datasets.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)