From Text to Knowledge
Beschreibung
vor 14 Jahren
The global information space provided by the World Wide Web has
changed dramatically the way knowledge is shared all over the
world. To make this unbelievable huge information space accessible,
search engines index the uploaded contents and provide efficient
algorithmic machinery for ranking the importance of documents with
respect to an input query. All major search engines such as Google,
Yahoo or Bing are keyword-based, which is indisputable a very
powerful tool for accessing information needs centered around
documents. However, this unstructured, document-oriented paradigm
of the World Wide Web has serious drawbacks, when searching for
specific knowledge about real-world entities. When asking for
advanced facts about entities, today's search engines are not very
good in providing accurate answers. Hand-built knowledge bases such
as Wikipedia or its structured counterpart DBpedia are excellent
sources that provide common facts. However, these knowledge bases
are far from being complete and most of the knowledge lies still
buried in unstructured documents. Statistical machine learning
methods have the great potential to help to bridge the gap between
text and knowledge by (semi-)automatically transforming the
unstructured representation of the today's World Wide Web to a more
structured representation. This thesis is devoted to reduce this
gap with Probabilistic Graphical Models. Probabilistic Graphical
Models play a crucial role in modern pattern recognition as they
merge two important fields of applied mathematics: Graph Theory and
Probability Theory. The first part of the thesis will present a
novel system called Text2SemRel that is able to
(semi-)automatically construct knowledge bases from textual
document collections. The resulting knowledge base consists of
facts centered around entities and their relations. Essential part
of the system is a novel algorithm for extracting relations between
entity mentions that is based on Conditional Random Fields, which
are Undirected Probabilistic Graphical Models. In the second part
of the thesis, we will use the power of Directed Probabilistic
Graphical Models to solve important knowledge discovery tasks in
semantically annotated large document collections. In particular,
we present extensions of the Latent Dirichlet Allocation framework
that are able to learn in an unsupervised way the statistical
semantic dependencies between unstructured representations such as
documents and their semantic annotations. Semantic annotations of
documents might refer to concepts originating from a thesaurus or
ontology but also to user-generated informal tags in social tagging
systems. These forms of annotations represent a first step towards
the conversion to a more structured form of the World Wide Web. In
the last part of the thesis, we prove the large-scale applicability
of the proposed fact extraction system Text2SemRel. In particular,
we extract semantic relations between genes and diseases from a
large biomedical textual repository. The resulting knowledge base
contains far more potential disease genes exceeding the number of
disease genes that are currently stored in curated databases. Thus,
the proposed system is able to unlock knowledge currently buried in
the literature. The literature-derived human gene-disease network
is subject of further analysis with respect to existing curated
state of the art databases. We analyze the derived knowledge base
quantitatively by comparing it with several curated databases with
regard to size of the databases and properties of known disease
genes among other things. Our experimental analysis shows that the
facts extracted from the literature are of high quality.
changed dramatically the way knowledge is shared all over the
world. To make this unbelievable huge information space accessible,
search engines index the uploaded contents and provide efficient
algorithmic machinery for ranking the importance of documents with
respect to an input query. All major search engines such as Google,
Yahoo or Bing are keyword-based, which is indisputable a very
powerful tool for accessing information needs centered around
documents. However, this unstructured, document-oriented paradigm
of the World Wide Web has serious drawbacks, when searching for
specific knowledge about real-world entities. When asking for
advanced facts about entities, today's search engines are not very
good in providing accurate answers. Hand-built knowledge bases such
as Wikipedia or its structured counterpart DBpedia are excellent
sources that provide common facts. However, these knowledge bases
are far from being complete and most of the knowledge lies still
buried in unstructured documents. Statistical machine learning
methods have the great potential to help to bridge the gap between
text and knowledge by (semi-)automatically transforming the
unstructured representation of the today's World Wide Web to a more
structured representation. This thesis is devoted to reduce this
gap with Probabilistic Graphical Models. Probabilistic Graphical
Models play a crucial role in modern pattern recognition as they
merge two important fields of applied mathematics: Graph Theory and
Probability Theory. The first part of the thesis will present a
novel system called Text2SemRel that is able to
(semi-)automatically construct knowledge bases from textual
document collections. The resulting knowledge base consists of
facts centered around entities and their relations. Essential part
of the system is a novel algorithm for extracting relations between
entity mentions that is based on Conditional Random Fields, which
are Undirected Probabilistic Graphical Models. In the second part
of the thesis, we will use the power of Directed Probabilistic
Graphical Models to solve important knowledge discovery tasks in
semantically annotated large document collections. In particular,
we present extensions of the Latent Dirichlet Allocation framework
that are able to learn in an unsupervised way the statistical
semantic dependencies between unstructured representations such as
documents and their semantic annotations. Semantic annotations of
documents might refer to concepts originating from a thesaurus or
ontology but also to user-generated informal tags in social tagging
systems. These forms of annotations represent a first step towards
the conversion to a more structured form of the World Wide Web. In
the last part of the thesis, we prove the large-scale applicability
of the proposed fact extraction system Text2SemRel. In particular,
we extract semantic relations between genes and diseases from a
large biomedical textual repository. The resulting knowledge base
contains far more potential disease genes exceeding the number of
disease genes that are currently stored in curated databases. Thus,
the proposed system is able to unlock knowledge currently buried in
the literature. The literature-derived human gene-disease network
is subject of further analysis with respect to existing curated
state of the art databases. We analyze the derived knowledge base
quantitatively by comparing it with several curated databases with
regard to size of the databases and properties of known disease
genes among other things. Our experimental analysis shows that the
facts extracted from the literature are of high quality.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)