BMC Research Notes | |
Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction | |
Xin Chen1  Ngoc Hieu Tran1  | |
[1] School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore | |
关键词: Next-generation sequencing; Sequence compression; Sequence distance; Alignment-free sequence comparison; | |
Others : 1132692 DOI : 10.1186/1756-0500-7-320 |
|
received in 2014-02-11, accepted in 2014-05-16, 发布年份 2014 | |
【 摘 要 】
Background
Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads.
Results
Recently several k-mer based distance measures such as CVTree, View MathML">, and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels.
Conclusions
The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads.
【 授权许可】
2014 Tran and Chen; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150304044915114.pdf | 2696KB | download | |
Figure 4. | 87KB | Image | download |
Figure 3. | 186KB | Image | download |
Figure 2. | 75KB | Image | download |
Figure 1. | 65KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
【 参考文献 】
- [1]Metzker ML: Sequence technologies - the next generation. Nat Rev Genet 2010, 11:31-46.
- [2]Waterman MS: Introduction to Computational Biology: Maps, Sequences, and Genomes. Boca Raton, FL: Chapman and Hall/CRC; 1995.
- [3]Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge: Cambridge University Press; 1999.
- [4]Vinga S, Almeida J: Alignment-free sequence comparison-a review. Bioinformatics 2003, 19:513-523.
- [5]Chan CX, Ragan MA: Next-generation phylogenomics. Biol Direct 2013, 8:3. BioMed Central Full Text
- [6]Qi J, Luo H, Hao B: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res 2004, 32:W45-W47.
- [7]Xu Z, Hao B: CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res 2009, 37:W174-W178.
- [8]Reinert G, Chew D, Sun F, Waterman MS: Alignment-free sequence comparison (I): statistics and power. J Comput Biol 2009, 16:1615-1634.
- [9]Wan L, Reinert G, Sun F, Waterman MS: Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol 2010, 17:1467-1490.
- [10]Yi H, Jin L: Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acid Res 2013, 41:e75.
- [11]Pham TD, Zuegg J: A probabilistic measure for alignment-free sequence comparison. Bioinformatics 2004, 20:3455-3461.
- [12]Kantorovitz MR, Robinson GE, Sinha S: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 2007, 23:i249-i255.
- [13]Dai Q, Yang Y, Wang T: Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics 2008, 24:2296-2302.
- [14]Jiang B, Song K, Ren J, Deng M, Sun F, Zhang X: Comparison of metagenomic samples using sequence signatures. BMC Genomics 2012, 13:730. BioMed Central Full Text
- [15]Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F: Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 2013, 20:64-79.
- [16]Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17:149-154.
- [17]Li M, Chen X, Li X, Ma B, Vitanyi PMB: The similarity metric. IEEE Trans Inform Theory 2004, 50:3250-3264.
- [18]Keogh E, Lonardi S, Ratanamahatana CA: Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 22-25 August 2004. Seattle, WA, USA; 2004:206-215.
- [19]Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19:2122-2130.
- [20]Li M, Vitanyi PMB: An introduction to Kolmogorov complexity and its applications. Springer; 2008. [http://www.springer.com/computer/theoretical+computer+science/book/978-0-387-33998-6 webcite]
- [21]Benedetto D, Caglioti E, Loreto V: Language tree and zipping. Phys Rev Lett 2002, 88:048702.
- [22]Ito K, Zuegmann T, Zhu Y: Recent experiences parameter-free data mining. In Proceedings of the 25th International Symposium on Computer and Information Sciences. 22-24 September 2010. London, UK; 2010:365-371.
- [23]Chen X, Kwong S, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Tenth Workshop on Genome Informatics. 14-15 December 1999. Tokyo, Japan; 1999:51-61.
- [24]Cao Y, Janke A, Waddell PJ, Westerman M, Takenaka O, Murata S, Okada N, Paabo S, Hasegawa M: Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J Mol Evol 1998, 47:307-322.
- [25]Muegge BD, Kuczynski J, Knights D, Clemente JC, Gonzalez A, Fontana L, Henrissat B, Knight R, Gordon JI: Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 2011, 332:970-974.
- [26]Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 2008, 3:e3373.
- [27]Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Soding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011, 7:539.
- [28]Felsenstein J: PHYLIP - phylogeny inference package (Version 3.2). Cladistics 1989, 5:164-166.
- [29]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425.
- [30]Robinson DR, Foulds LR: Comparison of phylogenetic trees. Math Biosci 1981, 53:131-147.
- [31]Schloss PD, Handelsman J: Introducing TreeClimber, a test to compare community structures. Appl Environ Microbiol 2006, 72:2379-2384.
- [32]Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009, 75:7537-7541.
- [33]Stover BC, Muller KF: TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics 2010, 11:7. BioMed Central Full Text
- [34]Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, Feng L, Lan R, Reeves PR, Wang L: Derivation of Escherichia coli O157:H7 from its O55:H7 precursor. PLoS ONE 2010, 5:e8700.
- [35]Cox AJ, Bauer MJ, Jacobi T, Rosone G: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 2012, 28:1415-1419.
- [36]Hach F, Numaganic I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012, 28:3051-3057.