BMC Bioinformatics | |
An investigation into inter- and intragenomic variations of graphic genomic signatures | |
Rallis Karamichalis1  Lila Kari1  Stavros Konstantinidis2  Steffen Kopecki2  | |
[1] Department of Computer Science, University of Western Ontario, London, ON, Canada | |
[2] Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, NS, Canada | |
关键词: Species classification; Genomic signature; Comparative genomics; | |
Others : 1230255 DOI : 10.1186/s12859-015-0655-4 |
|
received in 2014-12-19, accepted in 2015-06-30, 发布年份 2015 |
【 摘 要 】
Background
Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences.
Results
We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships.
Conclusion
Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
【 授权许可】
2015 Karamichalis et al.; licensee BioMed Central.
Files | Size | Format | View |
---|---|---|---|
Fig. 8. | 105KB | Image | download |
Fig. 7. | 114KB | Image | download |
Fig. 6. | 101KB | Image | download |
Fig. 5. | 101KB | Image | download |
Fig. 4. | 63KB | Image | download |
Fig. 3. | 70KB | Image | download |
Fig. 2. | 96KB | Image | download |
Fig. 1. | 106KB | Image | download |
Fig. 8. | 105KB | Image | download |
Fig. 7. | 114KB | Image | download |
Fig. 6. | 101KB | Image | download |
Fig. 5. | 101KB | Image | download |
Fig. 4. | 63KB | Image | download |
Fig. 3. | 70KB | Image | download |
Fig. 2. | 96KB | Image | download |
Fig. 1. | 106KB | Image | download |
【 图 表 】
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
【 参考文献 】
- [1]Hebert PD, Cywinska A, Ball SL et al.. Biological identifications through DNA barcodes. Proc R Soc Lond Series B: Biol Sci. 2003; 270(1512):313-21.
- [2]Sirovich L, Stoeckle MY, Zhang Y. Structural analysis of biodiversity. PLoS One. 2010; 5(2):e9266.
- [3]Jeffrey H. Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18(8):2163-170.
- [4]Deschavanne P, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16(10):1391-9.
- [5]Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283-90.
- [6]Jeffrey H. Chaos game visualization of sequences. Comput Graphics. 1992; 16(1):25-33.
- [7]Hill K, Schisler N, Singh S. Chaos game representation of coding regions of human globin genes and alcohol dehydrogenase genes of phylogenetically divergent species. J Mol Evol. 1992; 35(3):261-9.
- [8]Hill K, Singh S. Evolution of species-type specificity in the global DNA sequence organization of mitochondrial genomes. Genome. 1997; 40:342-56.
- [9]Deschavanne P, Giron A, Vilain J, Dufraigne C, Fertil B. Genomic signature is preserved in short DNA fragments. Proceedings of IEEE International Symposium on Bio-Informatics and Biomedical Engineering. IEEE, New York, USA; 2000.
- [10]Edwards S, Fertil B, Girron A, Deschavanne P. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst Biol. 2002; 51(4):599-613.
- [11]Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene. 2005; 346:173-85.
- [12]Kari L, Hill KA, Sayem AS, Karamichalis R, Bryans N, Davis K et al.. Mapping the space of genomic signatures. PLoS One. 2015; 10(5):e0119815.
- [13]Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004; 13(4):600-12.
- [14]Iversen GR, Gergen M, Gergen MM. Statistics: The Conceptual Approach. Springer, Berlin Heidelberg; 1997.
- [15]Krause EF. Taxicab Geometry: An Adventure in Non-Euclidean geometry. Courier Dover Publications, Mineola, New York; 2012.
- [16]Li M, Chen X, Li X, Ma B, Vitany P. The similarity metric. IEEE Trans Inf Theory. 2004; 50(12):3250-264.
- [17]Phillips GJ, Arnold J, Ivarie R. Mono-through hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res. 1987; 15(6):2611-626.
- [18]Beutler E, Gelbart T, Han J, Koziol JA, Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci. 1989; 86(1):192-6.
- [19]Deschavanne P, Radman M. Counterselection of GATC sequences in enterobacteriophages by the components of the methyl-directed mismatch repair system. J Mol Evol. 1991; 33(2):125-32.
- [20]Bhagwat AS, McClelland M. DNA mismatch correction by Very Short Patch repair may have altered the abundance of oligonucleotides in the E. coli genome. Nucleic Acids Res. 1992; 20(7):1663-1668.
- [21]Burge C, Campbell AM, Karlin S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci. 1992; 89(4):1358-62.
- [22]Karlin S, Burge C, Campbell AM. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992; 20(6):1363-70.
- [23]Blaisdell BE, Rudd KE, Matin A, Karlin S. Significant dispersed recurrent DNA sequences in the Escherichia coli genome: several new groups. J Mol Biol. 1993; 229(4):833-48.
- [24]Gelfand MS, Koonin EV. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res. 1997; 25(12):2430-439.
- [25]Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997; 179(12):3899-913.
- [26]Vinga S, Almeida J. Alignment-free sequence comparison–a review. Bioinformatics. 2003; 19(4):513-23.
- [27]Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2014; 15(6):890-905.
- [28]Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2014; 15(3):369-75.
- [29]Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci. 1986; 83(14):5155-159.
- [30]Sitnikova T, Zharkikh A. Statistical analysis of L-tuple frequencies in eubacteria and organelles. Biosystems. 1993; 30(1):113-35.
- [31]Wu TJ, Burke JP, Davison DB. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics. 1997; 53(4):1431-9.
- [32]Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics. 2001; 57(2):441-8.
- [33]Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics. 2002; 18(1):100-8.
- [34]Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J Mol Evol. 2004; 58(1):1-11.
- [35]Pham TD, Zuegg J. A probabilistic measure for alignment-free sequence comparison. Bioinformatics. 2004; 20(18):3455-461.
- [36]Pham TD. Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recog. 2007; 40(2):516-29.
- [37]Kantorovitz MR, Robinson GE, Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007; 23(13):249-55.
- [38]Van Helden J. Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics. 2004; 20(3):399-406.
- [39]Dai Q, Yang Y, Wang T. Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics. 2008; 24(20):2296-302.
- [40]Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics. 2001; 17(5):429-37.
- [41]Almeida JS, Vinga S. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics. 2002; 3(1):6.
- [42]Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol. 2006; 1:18.
- [43]Almeida JS, Vinga S. Biological sequences as pictures–a generic two dimensional solution for iterated maps. BMC Bioinformatics. 2009; 10(1):100.
- [44]Feng J, Hu Y, Wan P, Zhang A, Zhao W. New method for comparing DNA primary sequences based on a discrimination measure. J Theor Biol. 2010; 266(4):703-7.
- [45]Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV-1 genomes. Mol Phylogenet Evol. 2012; 62(2):756-63.
- [46]Pandit A, Vadlamudi J, Sinha S. Analysis of dinucleotide signatures in HIV-1 subtype B genomes. J Genet. 2013; 92(3):403-12.
- [47]Pride D, Meinersmann R, Wassenaar T, Blaser M. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003; 13(2):145-58.
- [48]Sandberg R, Bränden CI, Ernberg I, Cöster J. Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene. 2003; 311:35-42.
- [49]Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004; 5(1):163.
- [50]Chapus C, Dufraigne C, Edwards S, Giron A, Fertil B, Deschavanne P. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol Biol. 2005; 5(1):63.
- [51]Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005; 33(1):6.
- [52]Joseph J, Sasikumar R. Chaos game representation for comparison of whole genomes. BMC Bioinformatics. 2006; 7(1):243.
- [53]Tanchotsrinon W, Lursinsap C, Poovorawan Y. A high performance prediction of HPV genotypes by chaos game representation and singular value decomposition. BMC Bioinformatics. 2015; 16(1):71.
- [54]Karlin S, Ladunga I. Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci. 1994; 91(26):12832-6.
- [55]Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS et al.. Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci. 2007; 104(8):2767-772.
- [56]Deschavanne P, DuBow M, Regeard C. The use of genomic signature distance between bacteriophages and their hosts diplays evolutionary relationships and phage growth cycle determination. Virol J. 2010; 7(1):163.
- [57]Pandit A, Sinha S. Using genomic signatures for HIV-1 subtyping. BMC Bioinformatics. 2010; 11 Suppl 1:26.
- [58]Yu ZG, Zhan XW, Han GS, Wang RW, Anh V, Chu KH. Proper distance metrics for phylogenetic analysis using complete genomes without sequence alignment. Int J Mol Sci. 2010; 11(3):1141-54.
- [59]Online Material. https://github. com/rallis/intraSupplemental_Material webcite
- [60]Burma PK, Raj A, Deb JK, Brahmachari SK. Genome analysis: a new approach for visualization of sequence organization in genomes. J Biosci. 1992; 17(4):395-411.
- [61]Dutta C, Das J. Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis. J Mol Biol. 1992; 228(3):715-9.
- [62]Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993; 21(10):2487-491.
- [63]Oliver J, Bernaola-Galvan P, Guerrero-Garcıa J, Roman-Roldan R. Entropic profiles of DNA sequences through chaos-game-derived images. J Theor Biol. 1993; 160(4):457-70.
- [64]Deza MM, Deza E. Encyclopedia of Distances. Springer, Berlin Heidelberg; 2009.
- [65]Kruskal J. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964; 29(1):1-27.
- [66]Kari L, Sayem AS, Dattani N, Hill K. Map of life: Measuring and visualizing species’ relatedness with genome distance maps. University of Western Ontario Technical Report 756, 978–0771430220 April 2013.
- [67]Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference On, vol. 2,. IEEE, New York, USA; 2006. 2169–178
- [68]Karamichalis R. Molecular Distance Map Interactive Webtool. 2014. https://github. com/rallis/intraMoDMap webcite
- [69]Pang-Ning T, Steinbach M, Kumar V, et al.Introduction to data mining.Pearson; 2006.
- [70]Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn. 2004; 55(3):311-31.
- [71]Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53-65.