期刊论文详细信息
BMC Research Notes
Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
Xin Chen1  Ngoc Hieu Tran1 
[1] School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
关键词: Next-generation sequencing;    Sequence compression;    Sequence distance;    Alignment-free sequence comparison;   
Others  :  1132692
DOI  :  10.1186/1756-0500-7-320
 received in 2014-02-11, accepted in 2014-05-16,  发布年份 2014
PDF
【 摘 要 】

Background

Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads.

Results

Recently several k-mer based distance measures such as CVTree, <a onClick=View MathML">, and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels.

Conclusions

The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads.

【 授权许可】

   
2014 Tran and Chen; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150304044915114.pdf 2696KB PDF download
Figure 4. 87KB Image download
Figure 3. 186KB Image download
Figure 2. 75KB Image download
Figure 1. 65KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Metzker ML: Sequence technologies - the next generation. Nat Rev Genet 2010, 11:31-46.
  • [2]Waterman MS: Introduction to Computational Biology: Maps, Sequences, and Genomes. Boca Raton, FL: Chapman and Hall/CRC; 1995.
  • [3]Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge: Cambridge University Press; 1999.
  • [4]Vinga S, Almeida J: Alignment-free sequence comparison-a review. Bioinformatics 2003, 19:513-523.
  • [5]Chan CX, Ragan MA: Next-generation phylogenomics. Biol Direct 2013, 8:3. BioMed Central Full Text
  • [6]Qi J, Luo H, Hao B: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res 2004, 32:W45-W47.
  • [7]Xu Z, Hao B: CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res 2009, 37:W174-W178.
  • [8]Reinert G, Chew D, Sun F, Waterman MS: Alignment-free sequence comparison (I): statistics and power. J Comput Biol 2009, 16:1615-1634.
  • [9]Wan L, Reinert G, Sun F, Waterman MS: Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 2010, 17:1467-1490.
  • [10]Yi H, Jin L: Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acid Res 2013, 41:e75.
  • [11]Pham TD, Zuegg J: A probabilistic measure for alignment-free sequence comparison. Bioinformatics 2004, 20:3455-3461.
  • [12]Kantorovitz MR, Robinson GE, Sinha S: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 2007, 23:i249-i255.
  • [13]Dai Q, Yang Y, Wang T: Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 2008, 24:2296-2302.
  • [14]Jiang B, Song K, Ren J, Deng M, Sun F, Zhang X: Comparison of metagenomic samples using sequence signatures. BMC Genomics 2012, 13:730. BioMed Central Full Text
  • [15]Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F: Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 2013, 20:64-79.
  • [16]Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17:149-154.
  • [17]Li M, Chen X, Li X, Ma B, Vitanyi PMB: The similarity metric. IEEE Trans Inform Theory 2004, 50:3250-3264.
  • [18]Keogh E, Lonardi S, Ratanamahatana CA: Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 22-25 August 2004. Seattle, WA, USA; 2004:206-215.
  • [19]Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19:2122-2130.
  • [20]Li M, Vitanyi PMB: An introduction to Kolmogorov complexity and its applications. Springer; 2008. [http://www.springer.com/computer/theoretical+computer+science/book/978-0-387-33998-6 webcite]
  • [21]Benedetto D, Caglioti E, Loreto V: Language tree and zipping. Phys Rev Lett 2002, 88:048702.
  • [22]Ito K, Zuegmann T, Zhu Y: Recent experiences parameter-free data mining. In Proceedings of the 25th International Symposium on Computer and Information Sciences. 22-24 September 2010. London, UK; 2010:365-371.
  • [23]Chen X, Kwong S, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Tenth Workshop on Genome Informatics. 14-15 December 1999. Tokyo, Japan; 1999:51-61.
  • [24]Cao Y, Janke A, Waddell PJ, Westerman M, Takenaka O, Murata S, Okada N, Paabo S, Hasegawa M: Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J Mol Evol 1998, 47:307-322.
  • [25]Muegge BD, Kuczynski J, Knights D, Clemente JC, Gonzalez A, Fontana L, Henrissat B, Knight R, Gordon JI: Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 2011, 332:970-974.
  • [26]Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 2008, 3:e3373.
  • [27]Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011, 7:539.
  • [28]Felsenstein J: PHYLIP - phylogeny inference package (Version 3.2). Cladistics 1989, 5:164-166.
  • [29]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425.
  • [30]Robinson DR, Foulds LR: Comparison of phylogenetic trees. Math Biosci 1981, 53:131-147.
  • [31]Schloss PD, Handelsman J: Introducing TreeClimber, a test to compare community structures. Appl Environ Microbiol 2006, 72:2379-2384.
  • [32]Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009, 75:7537-7541.
  • [33]Stover BC, Muller KF: TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics 2010, 11:7. BioMed Central Full Text
  • [34]Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, Feng L, Lan R, Reeves PR, Wang L: Derivation of Escherichia coli O157:H7 from its O55:H7 precursor. PLoS ONE 2010, 5:e8700.
  • [35]Cox AJ, Bauer MJ, Jacobi T, Rosone G: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 2012, 28:1415-1419.
  • [36]Hach F, Numaganic I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012, 28:3051-3057.
  文献评价指标  
  下载次数:34次 浏览次数:11次