期刊论文详细信息
BMC Bioinformatics
A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences
Methodology Article
David J Russell1  Khalid Sayood1  Samuel F Way1  Andrew K Benson2 
[1] Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, 68588-0511, Lincoln, NE, USA;Department of Food Science and Technology, University of Nebraska-Lincoln, 143 Filley Hall, 68583-0919, Lincoln, NE, USA;Core for Applied Genomics and Ecology, University of Nebraska-Lincoln, 143 Filley Hall, 68583-0919, Lincoln, NE, USA;
关键词: Basis Sequence;    Representative Sequence;    Suffix Tree;    FASTA File;    Jaccard Coefficient;   
DOI  :  10.1186/1471-2105-11-601
 received in 2010-05-27, accepted in 2010-12-17,  发布年份 2010
来源: Springer
PDF
【 摘 要 】

BackgroundWe propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.ResultsThe performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.ConclusionsWe introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

【 授权许可】

CC BY   
© Russell et al; licensee BioMed Central Ltd. 2010. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

【 预 览 】
附件列表
Files Size Format View
RO202311099654272ZK.pdf 646KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  文献评价指标  
  下载次数:10次 浏览次数:1次