Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

Beschreibung

vor 13 Jahren
Knowledge Discovery in Databases (KDD) is the non-trivial process
of identifying valid, novel, useful and ultimately understandable
patterns in data. The core step of the KDD process is the
application of Data Mining (DM) algorithms to efficiently find
interesting patterns in large databases. This thesis concerns
itself with three inter-related themes: Generalised interaction and
rule mining; the incorporation of statistics into novel data mining
approaches; and probabilistic frequent pattern mining in uncertain
databases. An interaction describes an effect that variables have
-- or appear to have -- on each other. Interaction mining is the
process of mining structures on variables describing their
interaction patterns -- usually represented as sets, graphs or
rules. Interactions may be complex, represent both positive and
negative relationships, and the presence of interactions can
influence another interaction or variable in interesting ways.
Finding interactions is useful in domains ranging from social
network analysis, marketing, the sciences, e-commerce, to
statistics and finance. Many data mining tasks may be considered as
mining interactions, such as clustering; frequent itemset mining;
association rule mining; classification rules; graph mining; flock
mining; etc. Interaction mining problems can have very different
semantics, pattern definitions, interestingness measures and data
types. Solving a wide range of interaction mining problems at the
abstract level, and doing so efficiently -- ideally more
efficiently than with specialised approaches, is a challenging
problem. This thesis introduces and solves the Generalised
Interaction Mining (GIM) and Generalised Rule Mining (GRM)
problems. GIM and GRM use an efficient and intuitive computational
model based purely on vector valued functions. The semantics of the
interactions, their interestingness measures and the type of data
considered are flexible components of vectorised frameworks. By
separating the semantics of a problem from the algorithm used to
mine it, the frameworks allow both to vary independently of each
other. This makes it easier to develop new methods by focusing
purely on a problem's semantics and removing the burden of
designing an efficient algorithm. By encoding interactions as
vectors in the space (or a sub-space) of samples, they provide an
intuitive geometric interpretation that inspires novel methods. By
operating in time linear in the number of interesting interactions
that need to be examined, the GIM and GRM algorithms are optimal.
The use of GRM or GIM provides efficient solutions to a range of
problems in this thesis, including graph mining, counting based
methods, itemset mining, clique mining, a clustering problem,
complex pattern mining, negative pattern mining, solving an
optimisation problem, spatial data mining, probabilistic itemset
mining, probabilistic association rule mining, feature selection
and generation, classification and multiplication rule mining. Data
mining is a hypothesis generating endeavour, examining large
databases for patterns suggesting novel and useful knowledge to the
user. Since the database is a sample, the patterns found should
describe hypotheses about the underlying process generating the
data. In searching for these patterns, a DM algorithm makes
additional hypothesis when it prunes the search space. Natural
questions to ask then, are: "Does the algorithm find patterns that
are statistically significant?" and "Did the algorithm make
significant decisions during its search?". Such questions address
the quality of patterns found though data mining and the confidence
that a user can have in utilising them. Finally, statistics has a
range of useful tools and measures that are applicable in data
mining. In this context, this thesis incorporates statistical
techniques -- in particular, non-parametric significance tests and
correlation -- directly into novel data mining approaches. This
idea is applied to statistically significant and relatively class
correlated rule based classification of imbalanced data sets;
significant frequent itemset mining; mining complex correlation
structures between variables for feature selection; mining
correlated multiplication rules for interaction mining and feature
generation; and conjunctive correlation rules for classification.
The application of GIM or GRM to these problems lead to efficient
and intuitive solutions. Frequent itemset mining (FIM) is a
fundamental problem in data mining. While it is usually assumed
that the items occurring in a transaction are known for certain, in
many applications the data is inherently noisy or probabilistic;
such as adding noise in privacy preserving data mining
applications, aggregation or grouping of records leading to
estimated purchase probabilities, and databases capturing naturally
uncertain phenomena. The consideration of existential uncertainty
of item(sets) makes traditional techniques inapplicable. Prior to
the work in this thesis, itemsets were mined if their expected
support is high. This returns only an estimate, ignores the
probability distribution of support, provides no confidence in the
results, and can lead to scenarios where itemsets are labeled
frequent even if they are more likely to be infrequent. Clearly,
this is undesirable. This thesis proposes and solves the
Probabilistic Frequent Itemset Mining (PFIM) problem, where
itemsets are considered interesting if the probability that they
are frequent is high. The problem is solved under the possible
worlds model and a proposed probabilistic framework for PFIM. Novel
and efficient methods are developed for computing an itemset's
exact support probability distribution and frequentness
probability, using the Poisson binomial recurrence, generating
functions, or a Normal approximation. Incremental methods are
proposed to answer queries such as finding the top-k probabilistic
frequent itemsets. A number of specialised PFIM algorithms are
developed, with each being more efficient than the last: ProApriori
is the first solution to PFIM and is based on candidate generation
and testing. ProFP-Growth is the first probabilistic FP-Growth type
algorithm and uses a proposed probabilistic frequent pattern tree
(Pro-FPTree) to avoid candidate generation. Finally, the
application of GIM leads to GIM-PFIM; the fastest known algorithm
for solving the PFIM problem. It achieves orders of magnitude
improvements in space and time usage, and leads to an intuitive
subspace and probability-vector based interpretation of PFIM.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: