Beschreibung

vor 15 Jahren
In the 1990s a number of technological innovations appeared that
revolutionized biology, and 'Bioinformatics' became a new
scientific discipline. Microarrays can measure the abundance of
tens of thousands of mRNA species, data on the complete genomic
sequences of many different organisms are available, and other
technologies make it possible to study various processes at the
molecular level. In Bioinformatics and Biostatistics, current
research and computations are limited by the available computer
hardware. However, this problem can be solved using
high-performance computing resources. There are several reasons for
the increased focus on high-performance computing: larger data
sets, increased computational requirements stemming from more
sophisticated methodologies, and latest developments in computer
chip production. The open-source programming language 'R' was
developed to provide a powerful and extensible environment for
statistical and graphical techniques. There are many good reasons
for preferring R to other software or programming languages for
scientific computations (in statistics and biology). However, the
development of the R language was not aimed at providing a software
for parallel or high-performance computing. Nonetheless, during the
last decade, a great deal of research has been conducted on using
parallel computing techniques with R. This PhD thesis demonstrates
the usefulness of the R language and parallel computing for
biological research. It introduces parallel computing with R, and
reviews and evaluates existing techniques and R packages for
parallel computing on Computer Clusters, on Multi-Core Systems, and
in Grid Computing. From a computer-scientific point of view the
packages were examined as to their reusability in biological
applications, and some upgrades were proposed. Furthermore,
parallel applications for next-generation sequence data and
preprocessing of microarray data were developed. Microarray data
are characterized by high levels of noise and bias. As these
perturbations have to be removed, preprocessing of raw data has
been a research topic of high priority over the past few years. A
new Bioconductor package called affyPara for parallelized
preprocessing of high-density oligonucleotide microarray data was
developed and published. The partition of data can be performed on
arrays using a block cyclic partition, and, as a result,
parallelization of algorithms becomes directly possible. Existing
statistical algorithms and data structures had to be adjusted and
reformulated for the use in parallel computing. Using the new
parallel infrastructure, normalization methods can be enhanced and
new methods became available. The partition of data and
distribution to several nodes or processors solves the main memory
problem and accelerates the methods by up to the factor fifteen for
300 arrays or more. The final part of the thesis contains a huge
cancer study analysing more than 7000 microarrays from a publicly
available database, and estimating gene interaction networks. For
this purpose, a new R package for microarray data management was
developed, and various challenges regarding the analysis of this
amount of data are discussed. The comparison of gene networks for
different pathways and different cancer entities in the new amount
of data partly confirms already established forms of gene
interaction.

Kommentare (0)

Lade Inhalte...

Abonnenten

15
15
:
: