Structural Summaries as a Core Technology for Efficient XML Retrieval
Beschreibung
vor 17 Jahren
The Extensible Markup Language (XML) is extremely popular as a
generic markup language for text documents with an explicit
hierarchical structure. The different types of XML data found in
today’s document repositories, digital libraries, intranets and on
the web range from flat text with little meaningful structure to be
queried, over truly semistructured data with a rich and often
irregular structure, to rather rigidly structured documents with
little text that would also fit a relational database system
(RDBS). Not surprisingly, various ways of storing and retrieving
XML data have been investigated, including native XML systems,
relational engines based on RDBSs, and hybrid combinations thereof.
Over the years a number of native XML indexing techniques have
emerged, the most important ones being structure indices and
labelling schemes. Structure indices represent the document schema
(i.e., the hierarchy of nested tags that occur in the documents) in
a compact central data structure so that structural query
constraints (e.g., path or tree patterns) can be efficiently
matched without accessing the documents. Labelling schemes specify
ways to assign unique identifiers, or labels, to the document nodes
so that specific relations (e.g., parent/child) between individual
nodes can be inferred from their labels alone in a decentralized
manner, again without accessing the documents themselves. Since
both structure indices and labelling schemes provide compact
approximate views on the document structure, we collectively refer
to them as structural summaries. This work presents new structural
summaries that enable highly efficient and scalable XML retrieval
in native, relational and hybrid systems. The key contribution of
our approach is threefold. (1) We introduce BIRD, a very efficient
and expressive labelling scheme for XML, and the CADG, a combined
text and structure index, and combine them as two complementary
building blocks of the same XML retrieval system. (2) We propose a
purely relational variant of BIRD and the CADG, called RCADG, that
is extremely fast and scales up to large document collections. (3)
We present the RCADG Cache, a hybrid system that enhances the RCADG
with incremental query evaluation based on cached results of
earlier queries. The RCADG Cache exploits schema information in the
RCADG to detect cached query results that can supply some or all
matches to a new query with little or no computational and I/O
effort. A main-memory cache index ensures that reusable query
results are quickly retrieved even in a huge cache. Our work shows
that structural summaries significantly improve the efficiency and
scalability of XML retrieval systems in several ways. Former
relational approaches have largely ignored structural summaries.
The RCADG shows that these native indexing techniques are equally
effective for XML retrieval in RDBSs. BIRD, unlike some other
labelling schemes, achieves high retrieval performance with a
fairly modest storage overhead. To the best of our knowledge, the
RCADG Cache is the only approach to take advantage of structural
summaries for effectively detecting query containment or overlap.
Moreover, no other XML cache we know of exploits intermediate
results that are produced as a by-product during the evaluation
from scratch. These are valuable cache contents that increase the
effectiveness of the cache at no extra computational cost.
Extensive experiments quantify the practical benefit of all of the
proposed techniques, which amounts to a performance gain of several
orders of magnitude compared to various other approaches.
generic markup language for text documents with an explicit
hierarchical structure. The different types of XML data found in
today’s document repositories, digital libraries, intranets and on
the web range from flat text with little meaningful structure to be
queried, over truly semistructured data with a rich and often
irregular structure, to rather rigidly structured documents with
little text that would also fit a relational database system
(RDBS). Not surprisingly, various ways of storing and retrieving
XML data have been investigated, including native XML systems,
relational engines based on RDBSs, and hybrid combinations thereof.
Over the years a number of native XML indexing techniques have
emerged, the most important ones being structure indices and
labelling schemes. Structure indices represent the document schema
(i.e., the hierarchy of nested tags that occur in the documents) in
a compact central data structure so that structural query
constraints (e.g., path or tree patterns) can be efficiently
matched without accessing the documents. Labelling schemes specify
ways to assign unique identifiers, or labels, to the document nodes
so that specific relations (e.g., parent/child) between individual
nodes can be inferred from their labels alone in a decentralized
manner, again without accessing the documents themselves. Since
both structure indices and labelling schemes provide compact
approximate views on the document structure, we collectively refer
to them as structural summaries. This work presents new structural
summaries that enable highly efficient and scalable XML retrieval
in native, relational and hybrid systems. The key contribution of
our approach is threefold. (1) We introduce BIRD, a very efficient
and expressive labelling scheme for XML, and the CADG, a combined
text and structure index, and combine them as two complementary
building blocks of the same XML retrieval system. (2) We propose a
purely relational variant of BIRD and the CADG, called RCADG, that
is extremely fast and scales up to large document collections. (3)
We present the RCADG Cache, a hybrid system that enhances the RCADG
with incremental query evaluation based on cached results of
earlier queries. The RCADG Cache exploits schema information in the
RCADG to detect cached query results that can supply some or all
matches to a new query with little or no computational and I/O
effort. A main-memory cache index ensures that reusable query
results are quickly retrieved even in a huge cache. Our work shows
that structural summaries significantly improve the efficiency and
scalability of XML retrieval systems in several ways. Former
relational approaches have largely ignored structural summaries.
The RCADG shows that these native indexing techniques are equally
effective for XML retrieval in RDBSs. BIRD, unlike some other
labelling schemes, achieves high retrieval performance with a
fairly modest storage overhead. To the best of our knowledge, the
RCADG Cache is the only approach to take advantage of structural
summaries for effectively detecting query containment or overlap.
Moreover, no other XML cache we know of exploits intermediate
results that are produced as a by-product during the evaluation
from scratch. These are valuable cache contents that increase the
effectiveness of the cache at no extra computational cost.
Extensive experiments quantify the practical benefit of all of the
proposed techniques, which amounts to a performance gain of several
orders of magnitude compared to various other approaches.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)