Context based bioinformatics
Beschreibung
vor 11 Jahren
The goal of bioinformatics is to develop innovative and practical
methods and algorithms for bio- logical questions. In many cases,
these questions are driven by new biotechnological techniques,
especially by genome and cell wide high throughput experiment
studies. In principle there are two approaches: 1. Reduction and
abstraction of the question to a clearly defined optimization
problem, which can be solved with appropriate and efficient
algorithms. 2. Development of context based methods, incorporating
as much contextual knowledge as possible in the algorithms, and
derivation of practical solutions for relevant biological ques-
tions on the high-throughput data. These methods can be often
supported by appropriate software tools and visualizations,
allowing for interactive evaluation of the results by ex- perts.
Context based methods are often much more complex and require more
involved algorithmic techniques to get practical relevant and
efficient solutions for real world problems, as in many cases
already the simplified abstraction of problems result in NP-hard
problem instances. In many cases, to solve these complex problems,
one needs to employ efficient data structures and heuristic search
methods to solve clearly defined sub-problems using efficient
(polynomial) op- timization (such as dynamic programming, greedy,
path- or tree-algorithms). In this thesis, we present new methods
and analyses addressing open questions of bioinformatics from
different contexts by incorporating the corresponding contextual
knowledge. The two main contexts in this thesis are the protein
structure similarity context (Part I) and net- work based
interpretation of high-throughput data (Part II). For the protein
structure similarity context Part I we analyze the consistency of
gold standard structure classification systems and derive a
consistent benchmark set usable for different ap- plications. We
introduce two methods (Vorolign, PPM) for the protein structure
similarity recog- nition problem, based on different features of
the structures. Derived from the idea and results of Vorolign, we
introduce the concept of contact neighbor- hood potential, aiming
to improve the results of protein fold recognition and threading.
For the re-scoring problem of predicted structure models we
introduce the method Vorescore, clearly improving the
fold-recognition performance, and enabling the evaluation of the
contact neighborhood potential for structure prediction methods in
general. We introduce a contact consistent Vorolign variant
ccVorolign further improving the structure based fold recognition
performance, and enabling direct optimization of the neighborhood
po- tential in the future. Due to the enforcement of
contact-consistence, the ccVorolign method has much higher
computational complexity than the polynomial Vorolign method - the
cost of com- puting interpretable and consistent alignments.
Finally, we introduce a novel structural alignment method (PPM)
enabling the explicit modeling and handling of phenotypic
plasticity in protein structures. We employ PPM for the analysis of
effects of alternative splicing on protein structures. With the
help of PPM we test the hypothesis, whether splice isoforms of the
same protein can lead to protein structures with different folds
(fold transitions). In Part II of the thesis we present methods
generating and using context information for the interpretation of
high-throughput experiments. For the generation of context
information of molecular regulations we introduce novel textmin-
ing approaches extracting relations automatically from scientific
publications. In addition to the fast NER (named entity
recognition) method (syngrep) we also present a novel, fully
ontology-based context-sensitive method (SynTree) allowing for the
context-specific dis- ambiguation of ambiguous synonyms and
resulting in much better identification performance. This context
information is important for the interpretation of high-throughput
data, but often missing in current databases. Despite all
improvements, the results of automated text-mining methods are
error prone. The RelAnn application presented in this thesis helps
to curate the automatically extracted regula- tions enabling manual
and ontology based curation and annotation. For the usage of
high-throughput data one needs additional methods for data
processing, for example methods to map the hundreds of millions
short DNA/RNA fragments (so called reads) on a reference genome or
transcriptome. Such data (RNA-seq reads) are the output of next
generation sequencing methods measured by sequencing machines,
which are becoming more and more efficient and affordable. Other
than current state-of-the-art methods, our novel read-mapping
method ContextMap re- solves the occurring ambiguities at the final
step of the mapping process, employing thereby the knowledge of the
complete set of possible ambiguous mappings. This approach allows
for higher precision, even if more nucleotide errors are tolerated
in the read mappings in the first step. The consistence between
context information of molecular regulations stored in databases
and extracted from textmining against measured data can be used to
identify and score consistent reg- ulations (GGEA). This method
substantially extends the commonly used gene-set based methods such
over-representation (ORA) and gene set enrichment analysis (GSEA).
Finally we introduce the novel method RelExplain, which uses the
extracted contextual knowl- edge and generates network-based and
testable hypotheses for the interpretation of high-throughput data.
methods and algorithms for bio- logical questions. In many cases,
these questions are driven by new biotechnological techniques,
especially by genome and cell wide high throughput experiment
studies. In principle there are two approaches: 1. Reduction and
abstraction of the question to a clearly defined optimization
problem, which can be solved with appropriate and efficient
algorithms. 2. Development of context based methods, incorporating
as much contextual knowledge as possible in the algorithms, and
derivation of practical solutions for relevant biological ques-
tions on the high-throughput data. These methods can be often
supported by appropriate software tools and visualizations,
allowing for interactive evaluation of the results by ex- perts.
Context based methods are often much more complex and require more
involved algorithmic techniques to get practical relevant and
efficient solutions for real world problems, as in many cases
already the simplified abstraction of problems result in NP-hard
problem instances. In many cases, to solve these complex problems,
one needs to employ efficient data structures and heuristic search
methods to solve clearly defined sub-problems using efficient
(polynomial) op- timization (such as dynamic programming, greedy,
path- or tree-algorithms). In this thesis, we present new methods
and analyses addressing open questions of bioinformatics from
different contexts by incorporating the corresponding contextual
knowledge. The two main contexts in this thesis are the protein
structure similarity context (Part I) and net- work based
interpretation of high-throughput data (Part II). For the protein
structure similarity context Part I we analyze the consistency of
gold standard structure classification systems and derive a
consistent benchmark set usable for different ap- plications. We
introduce two methods (Vorolign, PPM) for the protein structure
similarity recog- nition problem, based on different features of
the structures. Derived from the idea and results of Vorolign, we
introduce the concept of contact neighbor- hood potential, aiming
to improve the results of protein fold recognition and threading.
For the re-scoring problem of predicted structure models we
introduce the method Vorescore, clearly improving the
fold-recognition performance, and enabling the evaluation of the
contact neighborhood potential for structure prediction methods in
general. We introduce a contact consistent Vorolign variant
ccVorolign further improving the structure based fold recognition
performance, and enabling direct optimization of the neighborhood
po- tential in the future. Due to the enforcement of
contact-consistence, the ccVorolign method has much higher
computational complexity than the polynomial Vorolign method - the
cost of com- puting interpretable and consistent alignments.
Finally, we introduce a novel structural alignment method (PPM)
enabling the explicit modeling and handling of phenotypic
plasticity in protein structures. We employ PPM for the analysis of
effects of alternative splicing on protein structures. With the
help of PPM we test the hypothesis, whether splice isoforms of the
same protein can lead to protein structures with different folds
(fold transitions). In Part II of the thesis we present methods
generating and using context information for the interpretation of
high-throughput experiments. For the generation of context
information of molecular regulations we introduce novel textmin-
ing approaches extracting relations automatically from scientific
publications. In addition to the fast NER (named entity
recognition) method (syngrep) we also present a novel, fully
ontology-based context-sensitive method (SynTree) allowing for the
context-specific dis- ambiguation of ambiguous synonyms and
resulting in much better identification performance. This context
information is important for the interpretation of high-throughput
data, but often missing in current databases. Despite all
improvements, the results of automated text-mining methods are
error prone. The RelAnn application presented in this thesis helps
to curate the automatically extracted regula- tions enabling manual
and ontology based curation and annotation. For the usage of
high-throughput data one needs additional methods for data
processing, for example methods to map the hundreds of millions
short DNA/RNA fragments (so called reads) on a reference genome or
transcriptome. Such data (RNA-seq reads) are the output of next
generation sequencing methods measured by sequencing machines,
which are becoming more and more efficient and affordable. Other
than current state-of-the-art methods, our novel read-mapping
method ContextMap re- solves the occurring ambiguities at the final
step of the mapping process, employing thereby the knowledge of the
complete set of possible ambiguous mappings. This approach allows
for higher precision, even if more nucleotide errors are tolerated
in the read mappings in the first step. The consistence between
context information of molecular regulations stored in databases
and extracted from textmining against measured data can be used to
identify and score consistent reg- ulations (GGEA). This method
substantially extends the commonly used gene-set based methods such
over-representation (ORA) and gene set enrichment analysis (GSEA).
Finally we introduce the novel method RelExplain, which uses the
extracted contextual knowl- edge and generates network-based and
testable hypotheses for the interpretation of high-throughput data.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)