kClust: fast and sensitive clustering of large protein sequence databases ~ Medizin - Open Access LMU

Background: Fueled by rapid progress in high-throughput sequencing,
the size of public sequence databases doubles every two years.
Searching the ever larger and more redundant databases is getting
increasingly inefficient. Clustering can help to organize sequences
into homologous and functionally similar groups and can improve the
speed, sensitivity, and readability of homology searches. However,
because the clustering time is quadratic in the number of
sequences, standard sequence search methods are becoming
impracticable. Results: Here we present a method to cluster large
protein sequence databases such as UniProt within days down to
20\%-30\% maximum pairwise sequence identity. kClust owes its speed
and sensitivity to an alignment-free prefilter that calculates the
cumulative score of all similar 6-mers between pairs of sequences,
and to a dynamic programming algorithm that operates on pairs of
similar 4-mers. To increase sensitivity further, kClust can run in
profile-sequence comparison mode, with profiles computed from the
clusters of a previous kClust iteration. kClust is two to three
orders of magnitude faster than clustering based on NCBI BLAST, and
on multidomain sequences of 20\%-30\% maximum pairwise sequence
identity it achieves comparable sensitivity and a lower false
discovery rate. It also compares favorably to CD-HIT and UCLUST in
terms of false discovery rate, sensitivity, and speed. Conclusions:
kClust fills the need for a fast, sensitive, and accurate tool to
cluster large protein sequence databases to below 30\% sequence
identity. kClust is freely available under GPL at
ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/.

kClust: fast and sensitive clustering of large protein sequence databases

Beschreibung

Weitere Episoden

Persistent nasal methicillin-resistant staphylococcus aureus carriage in hemodialysis outpatients: a predictor of worse outcome

Dose-volumetric parameters and prediction of severe acute esophagitis in patients with locally-advanced non small-cell lung cancer treated with neoadjuvant concurrent hyperfractionated-accelerated chemoradiotherapy

Factors influencing the efficiency of generating genetically engineered pigs by nuclear transfer: multi-factorial analysis of a large data set

Genome-wide linkage analysis of congenital heart defects using MOD score analysis identifies two novel loci

Comparison of symptomatic and asymptomatic atherosclerotic carotid plaques using parallel imaging and 3 T black-blood in vivo CMR

Kommentare (0)

Abonnenten

Anmelden mit