Beschreibung

vor 17 Jahren
As new graph structured data is constantly being generated,
learning and data mining on graphs have become a challenge in
application areas such as molecular biology, telecommunications,
chemoinformatics, and social network analysis. The central
algorithmic problem in these areas, measuring similarity of graphs,
has therefore received extensive attention in the recent past.
Unfortunately, existing approaches are slow, lacking in
expressivity, or hard to parameterize. Graph kernels have recently
been proposed as a theoretically sound and promising approach to
the problem of graph comparison. Their attractivity stems from the
fact that by defining a kernel on graphs, a whole family of data
mining and machine learning algorithms becomes applicable to
graphs. These kernels on graphs must respect both the information
represented by the topology and the node and edge labels of the
graphs, while being efficient to compute. Existing methods fall
woefully short; they miss out on important topological information,
are plagued by runtime issues, and do not scale to large graphs.
Hence the primary goal of this thesis is to make learning and data
mining with graph kernels feasible. In the first half of this
thesis, we review and analyze the shortcomings of state-of-the-art
graph kernels. We then propose solutions to overcome these
weaknesses. As highlights of our research, we - speed up the
classic random walk graph kernel from O(n^6) to O(n^3), where n is
the number of nodes in the larger graph, and by a factor of up to
1,000 in CPU runtime, by extending concepts from Linear Algebra to
Reproducing Kernel Hilbert Spaces, - define novel graph kernels
based on shortest paths that avoid tottering and outperform random
walk kernels in accuracy, - define novel graph kernels that
estimate the frequency of small subgraphs within a large graph and
that work on large graphs hitherto not handled by existing graph
kernels. In the second half of this thesis, we present algorithmic
solutions to two novel problems in graph mining. First, we define a
two-sample test on graphs. Given two sets of graphs, or a pair of
graphs, this test lets us decide whether these graphs are likely to
originate from the same underlying distribution. To solve this
so-called two-sample-problem, we define the first kernel-based
two-sample test. Combined with graph kernels, this results in the
first two-sample test on graphs described in the literature.
Second, we propose a principled approach to supervised feature
selection on graphs. As in feature selection on vectors, feature
selection on graphs aims at finding features that are correlated
with the class membership of a graph. Towards this goal, we first
define a family of supervised feature selection algorithms based on
kernels and the Hilbert-Schmidt Independence Criterion. We then
show how to extend this principle of feature selection to graphs,
and how to combine it with gSpan, the state-of-the-art method for
frequent subgraph mining. On several benchmark datasets, our novel
procedure manages to select a small subset of dozens of informative
features among thousands and millions of subgraphs detected by
gSpan. In classification experiments, the features selected by our
method outperform those chosen by other feature selectors in terms
of classification accuracy. Along the way, we also solve several
problems that can be deemed contributions in their own right: - We
define a unifying framework for describing both variants of random
walk graph kernels proposed in the literature. - We present the
first theoretical connection between graph kernels and molecular
descriptors from chemoinformatics. - We show how to determine
sample sizes for estimating the frequency of certain subgraphs
within a large graph with a given precision and confidence, which
promises to be a key to the solution of important problems in data
mining and bioinformatics. Three branches of computer science
immediately benefit from our findings: data mining, machine
learning, and bioinformatics. For data mining, our efficient graph
kernels allow us to bring to bear the large family of kernel
methods to mining problems on real-world graph data. For machine
learning, we open the door to extend strong theoretical results on
learning on graphs into useful practical applications. For
bioinformatics, we make a number of principled kernel methods and
efficient kernel functions available for biological network
comparison, and structural comparisons of proteins. Apart from
these three areas, other fields may also benefit from our findings,
as our algorithms are general in nature and not restricted to a
particular type of application.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: