Beschreibung

vor 16 Jahren
Due to the increase in CPU power and the ever increasing data
storage capabilities, more and more data of all kind is recorded,
including temporal data. Time series, the most prevalent type of
temporal data are derived in a broad number of application domains.
Prominent examples include stock price data in economy, gene
expression data in biology, the course of environmental parameters
in meteorology, or data of moving objects recorded by traffic
sensors. This large amount of raw data can only be analyzed by
automated data mining algorithms in order to generate new
knowledge. One of the most basic data mining operations is the
similarity query, which computes a similarity or distance value for
two objects. Two aspects of such an similarity function are of
special interest. First, the semantics of a similarity function and
second, the computational cost for the calculation of a similarity
value. The semantics is the actual similarity notion and is highly
dependant on the analysis task at hand. This thesis addresses both
aspects. We introduce a number of new similarity measures for time
series data and show how they can efficiently be calculated by
means of index structures and query algorithms. The first of the
new similarity measures is threshold-based. Two time series are
considered as similar, if they exceed a user-given threshold during
similar time intervals. Aside from formally defining this
similarity measure, we show how to represent time series in such a
way that threshold-based queries can be efficiently calculated. Our
representation allows for the specification of the threshold value
at query time. This is for example useful for data mining task that
try to determine crucial thresholds. The next similarity measure
considers a relevant amplitude range. This range is scanned with a
certain resolution and for each considered amplitude value features
are extracted. We consider the change in the feature values over
the amplitude values and thus, generate so-called feature
sequences. Different features can finally be combined to answer
amplitude-level-based similarity queries. In contrast to
traditional approaches which aggregate global feature values along
the time dimension, we capture local characteristics and monitor
their change for different amplitude values. Furthermore, our
method enables the user to specify a relevant range of amplitude
values to be considered and so the similarity notion can be adapted
to the current requirements. Next, we introduce so-called
interval-focused similarity queries. A user can specify one or
several time intervals that should be considered for the
calculation of the similarity value. Our main focus for this
similarity measure was the efficient support of the corresponding
query. In particular we try to avoid loading the complete time
series objects into main memory, if only a relatively small portion
of a time series is of interest. We propose a time series
representation which can be used to calculate upper and lower
distance bounds, so that only a few time series objects have to be
completely loaded and refined. Again, the relevant time intervals
do not have to be known in advance. Finally, we define a similarity
measure for so-called uncertain time series, where several
amplitude values are given for each point in time. This can be due
to multiple recordings or to errors in measurements, so that no
exact value can be specified. We show how to efficiently support
queries on uncertain time series. The last part of this thesis
shows how data mining methods can be used to discover crucial
threshold parameters for the threshold-based similarity measure.
Furthermore we present a data mining tool for time series.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: