Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data
Beschreibung
vor 17 Jahren
Microarrays can capture gene expression activity for thousands of
genes simultaneously and thus make it possible to analyze cell
physiology and disease processes on molecular level. The
interpretation of microarray gene expression experiments profits
from knowledge on the analyzed genes and proteins and the
biochemical networks in which they play a role. The trend is
towards the development of data analysis methods that integrate
diverse data types. Currently, the most comprehensive biomedical
knowledge source is a large repository of free text articles. Text
mining makes it possible to automatically extract and use
information from texts. This thesis addresses two key aspects,
biomedical text mining and gene expression data analysis, with the
focus on providing high-quality methods and data that contribute to
the development of integrated analysis approaches. The work is
structured in three parts. Each part begins by providing the
relevant background, and each chapter describes the developed
methods as well as applications and results. Part I deals with
biomedical text mining: Chapter 2 summarizes the relevant
background of text mining; it describes text mining fundamentals,
important text mining tasks, applications and particularities of
text mining in the biomedical domain, and evaluation issues. In
Chapter 3, a method for generating high-quality gene and protein
name dictionaries is described. The analysis of the generated
dictionaries revealed important properties of individual
nomenclatures and the used databases (Fundel and Zimmer, 2006). The
dictionaries are publicly available via a Wiki, a web service, and
several client applications (Szugat et al., 2005). In Chapter 4,
methods for the dictionary-based recognition of gene and protein
names in texts and their mapping onto unique database identifiers
are described. These methods make it possible to extract
information from texts and to integrate text-derived information
with data from other sources. Three named entity identification
systems have been set up, two of them building upon the previously
existing tool ProMiner (Hanisch et al., 2003). All of them have
shown very good performance in the BioCreAtIvE challenges (Fundel
et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In
Chapter 5, a new method for relation extraction (Fundel et al.,
2007) is presented. It was applied on the largest collection of
biomedical literature abstracts, and thus a comprehensive network
of human gene and protein relations has been generated. A
classification approach (Küffner et al., 2006) can be used to
specify relation types further; e. g., as activating, direct
physical, or gene regulatory relation. Part II deals with gene
expression data analysis: Gene expression data needs to be
processed so that differentially expressed genes can be identified.
Gene expression data processing consists of several sequential
steps. Two important steps are normalization, which aims at
removing systematic variances between measurements, and
quantification of differential expression by p-value and fold
change determination. Numerous methods exist for these tasks.
Chapter 6 describes the relevant background of gene expression data
analysis; it presents the biological and technical principles of
microarrays and gives an overview of the most relevant data
processing steps. Finally, it provides a short introduction to
osteoarthritis, which is in the focus of the analyzed gene
expression data sets. In Chapter 7, quality criteria for the
selection of normalization methods are described, and a method for
the identification of differentially expressed genes is proposed,
which is appropriate for data with large intensity variances
between spots representing the same gene (Fundel et al., 2005b).
Furthermore, a system is described that selects an appropriate
combination of feature selection method and classifier, and thus
identifies genes which lead to good classification results and show
consistent behavior in different sample subgroups (Davis et al.,
2006). The analysis of several gene expression data sets dealing
with osteoarthritis is described in Chapter 8. This chapter
contains the biomedical analysis of relevant disease processes and
distinct disease stages (Aigner et al., 2006a), and a comparison of
various microarray platforms and osteoarthritis models. Part III
deals with integrated approaches and thus provides the connection
between parts I and II: Chapter 9 gives an overview of different
types of integrated data analysis approaches, with a focus on
approaches that integrate gene expression data with manually
compiled data, large-scale networks, or text mining. In Chapter 10,
a method for the identification of genes which are consistently
regulated and have a coherent literature background (Küffner et
al., 2005) is described. This method indicates how gene and protein
name identification and gene expression data can be integrated to
return clusters which contain genes that are relevant for the
respective experiment together with literature information that
supports interpretation. Finally, in Chapter 11 ideas on how the
described methods can contribute to current research and possible
future directions are presented.
genes simultaneously and thus make it possible to analyze cell
physiology and disease processes on molecular level. The
interpretation of microarray gene expression experiments profits
from knowledge on the analyzed genes and proteins and the
biochemical networks in which they play a role. The trend is
towards the development of data analysis methods that integrate
diverse data types. Currently, the most comprehensive biomedical
knowledge source is a large repository of free text articles. Text
mining makes it possible to automatically extract and use
information from texts. This thesis addresses two key aspects,
biomedical text mining and gene expression data analysis, with the
focus on providing high-quality methods and data that contribute to
the development of integrated analysis approaches. The work is
structured in three parts. Each part begins by providing the
relevant background, and each chapter describes the developed
methods as well as applications and results. Part I deals with
biomedical text mining: Chapter 2 summarizes the relevant
background of text mining; it describes text mining fundamentals,
important text mining tasks, applications and particularities of
text mining in the biomedical domain, and evaluation issues. In
Chapter 3, a method for generating high-quality gene and protein
name dictionaries is described. The analysis of the generated
dictionaries revealed important properties of individual
nomenclatures and the used databases (Fundel and Zimmer, 2006). The
dictionaries are publicly available via a Wiki, a web service, and
several client applications (Szugat et al., 2005). In Chapter 4,
methods for the dictionary-based recognition of gene and protein
names in texts and their mapping onto unique database identifiers
are described. These methods make it possible to extract
information from texts and to integrate text-derived information
with data from other sources. Three named entity identification
systems have been set up, two of them building upon the previously
existing tool ProMiner (Hanisch et al., 2003). All of them have
shown very good performance in the BioCreAtIvE challenges (Fundel
et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In
Chapter 5, a new method for relation extraction (Fundel et al.,
2007) is presented. It was applied on the largest collection of
biomedical literature abstracts, and thus a comprehensive network
of human gene and protein relations has been generated. A
classification approach (Küffner et al., 2006) can be used to
specify relation types further; e. g., as activating, direct
physical, or gene regulatory relation. Part II deals with gene
expression data analysis: Gene expression data needs to be
processed so that differentially expressed genes can be identified.
Gene expression data processing consists of several sequential
steps. Two important steps are normalization, which aims at
removing systematic variances between measurements, and
quantification of differential expression by p-value and fold
change determination. Numerous methods exist for these tasks.
Chapter 6 describes the relevant background of gene expression data
analysis; it presents the biological and technical principles of
microarrays and gives an overview of the most relevant data
processing steps. Finally, it provides a short introduction to
osteoarthritis, which is in the focus of the analyzed gene
expression data sets. In Chapter 7, quality criteria for the
selection of normalization methods are described, and a method for
the identification of differentially expressed genes is proposed,
which is appropriate for data with large intensity variances
between spots representing the same gene (Fundel et al., 2005b).
Furthermore, a system is described that selects an appropriate
combination of feature selection method and classifier, and thus
identifies genes which lead to good classification results and show
consistent behavior in different sample subgroups (Davis et al.,
2006). The analysis of several gene expression data sets dealing
with osteoarthritis is described in Chapter 8. This chapter
contains the biomedical analysis of relevant disease processes and
distinct disease stages (Aigner et al., 2006a), and a comparison of
various microarray platforms and osteoarthritis models. Part III
deals with integrated approaches and thus provides the connection
between parts I and II: Chapter 9 gives an overview of different
types of integrated data analysis approaches, with a focus on
approaches that integrate gene expression data with manually
compiled data, large-scale networks, or text mining. In Chapter 10,
a method for the identification of genes which are consistently
regulated and have a coherent literature background (Küffner et
al., 2005) is described. This method indicates how gene and protein
name identification and gene expression data can be integrated to
return clusters which contain genes that are relevant for the
respective experiment together with literature information that
supports interpretation. Finally, in Chapter 11 ideas on how the
described methods can contribute to current research and possible
future directions are presented.
Weitere Episoden
vor 11 Jahren
vor 11 Jahren
vor 11 Jahren
In Podcasts werben
Kommentare (0)