kClust: fast and sensitive clustering of large protein sequence databases
Podcast
Podcaster
Beschreibung
vor 11 Jahren
Background: Fueled by rapid progress in high-throughput sequencing,
the size of public sequence databases doubles every two years.
Searching the ever larger and more redundant databases is getting
increasingly inefficient. Clustering can help to organize sequences
into homologous and functionally similar groups and can improve the
speed, sensitivity, and readability of homology searches. However,
because the clustering time is quadratic in the number of
sequences, standard sequence search methods are becoming
impracticable. Results: Here we present a method to cluster large
protein sequence databases such as UniProt within days down to
20\%-30\% maximum pairwise sequence identity. kClust owes its speed
and sensitivity to an alignment-free prefilter that calculates the
cumulative score of all similar 6-mers between pairs of sequences,
and to a dynamic programming algorithm that operates on pairs of
similar 4-mers. To increase sensitivity further, kClust can run in
profile-sequence comparison mode, with profiles computed from the
clusters of a previous kClust iteration. kClust is two to three
orders of magnitude faster than clustering based on NCBI BLAST, and
on multidomain sequences of 20\%-30\% maximum pairwise sequence
identity it achieves comparable sensitivity and a lower false
discovery rate. It also compares favorably to CD-HIT and UCLUST in
terms of false discovery rate, sensitivity, and speed. Conclusions:
kClust fills the need for a fast, sensitive, and accurate tool to
cluster large protein sequence databases to below 30\% sequence
identity. kClust is freely available under GPL at
ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/.
the size of public sequence databases doubles every two years.
Searching the ever larger and more redundant databases is getting
increasingly inefficient. Clustering can help to organize sequences
into homologous and functionally similar groups and can improve the
speed, sensitivity, and readability of homology searches. However,
because the clustering time is quadratic in the number of
sequences, standard sequence search methods are becoming
impracticable. Results: Here we present a method to cluster large
protein sequence databases such as UniProt within days down to
20\%-30\% maximum pairwise sequence identity. kClust owes its speed
and sensitivity to an alignment-free prefilter that calculates the
cumulative score of all similar 6-mers between pairs of sequences,
and to a dynamic programming algorithm that operates on pairs of
similar 4-mers. To increase sensitivity further, kClust can run in
profile-sequence comparison mode, with profiles computed from the
clusters of a previous kClust iteration. kClust is two to three
orders of magnitude faster than clustering based on NCBI BLAST, and
on multidomain sequences of 20\%-30\% maximum pairwise sequence
identity it achieves comparable sensitivity and a lower false
discovery rate. It also compares favorably to CD-HIT and UCLUST in
terms of false discovery rate, sensitivity, and speed. Conclusions:
kClust fills the need for a fast, sensitive, and accurate tool to
cluster large protein sequence databases to below 30\% sequence
identity. kClust is freely available under GPL at
ftp://toolkit.lmb.uni-muenchen.de/pub/kClust/.
Weitere Episoden
In Podcasts werben
Kommentare (0)