BMC Genomics | |
SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data | |
Andrew H Paterson1  Changsoo Kim2  Xiyin Wang3  Hui Guo2  Tae-Ho Lee2  | |
[1] Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA;Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA;Center for Genomics and Computational Biology, School of Life Sciences and School of Sciences, Hebei United University, Tangshan, Hebei 063009, China | |
关键词: Maximum likelihood; Linkage disequilibrium; Polymorphisms; | |
Others : 1217834 DOI : 10.1186/1471-2164-15-162 |
|
received in 2013-09-25, accepted in 2014-02-18, 发布年份 2014 | |
【 摘 要 】
Background
Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data.
Results
We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline.
Conclusions
Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.
【 授权许可】
2014 Lee et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150708151330535.pdf | 1226KB | download | |
Figure 5. | 66KB | Image | download |
Figure 4. | 95KB | Image | download |
Figure 3. | 78KB | Image | download |
Figure 2. | 74KB | Image | download |
Figure 1. | 101KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
【 参考文献 】
- [1]Initiative TAG: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408(6814):796-815.
- [2]International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature 2005, 436(7052):793-800.
- [3]Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, et al.: Genome sequence of the palaeopolyploid soybean. Nature 2010, 463(7278):178-183.
- [4]Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin D, Llewellyn D, Showmaker KC, Shu S, Udall J, Yoo MJ, Byers R, Chen W, Doron-Faigenboim A, Duke MV, Gong L, Grimwood J, Grover C, Grupp K, Hu G, Lee TH, Li J, Lin L, Liu T, Marler BS, Page JT, Roberts AW, Romanel E, Sanders WS, Szadkowski E, et al.: Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature 2012, 492(7429):423-427.
- [5]The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491:56-65.
- [6]Guo S, Zhang J, Sun H, Salse J, Lucas WJ, Zhang H, Zheng Y, Mao L, Ren Y, Wang Z, Min J, Guo X, Murat F, Ham BK, Zhang Z, Gao S, Huang M, Xu Y, Zhong S, Bombarely A, Mueller LA, Zhao H, He H, Zhang Y, Zhang Z, Huang S, Tan T, Pang E, Lin K, Hu Q, et al.: The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat Genet 2013, 45(1):51-58.
- [7]Lam HM, Xu X, Liu X, Chen W, Yang G, Wong FL, Li MW, He W, Qin N, Wang B, Li J, Jian M, Wang J, Shao G, Wang J, Sun SS, Zhang G: Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nat Genet 2010, 42(12):1053-1059.
- [8]Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L, Li J, He W, Zhang G, Zheng X, Zhang F, Li Y, Yu C, Kristiansen K, Zhang X, Wang J, Wright M, McCouch S, Nielsen R, Wang J, Wang W: Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat Biotechnol 2012, 30(1):105-111.
- [9]Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, Wang X, Ott F, Muller J, Alonso-Blanco C, Borgwardt K, Schmid KJ, Weigel D: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet 2011, 43(10):956-963.
- [10]McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE, Stokowski R, Ballinger DG, Frazer KA, Cox DR, Padhukasahasram B, Bustamante CD, Weigel D, Mackill DJ, Bruskiewich RM, Ratsch G, Buell CR, Leung H, Leach JE: Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proc Natl Acad Sci USA 2009, 106(30):12273-12278.
- [11]Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5(2):163-166.
- [12]Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 2007, 24(8):1596-1599.
- [13]Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM: snpTree - a web-server to identify and construct SNP trees from whole genome sequence data. BMC Genomics 2012, 13(Suppl 7):S6. BioMed Central Full Text
- [14]Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS: A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 2012, 28(24):3326-3328.
- [15]Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797.
- [16]Schliep KP: phangorn: phylogenetic analysis in R. Bioinformatics 2011, 27(4):592-593.
- [17]Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 2010, 59(3):307-321.
- [18]Paradis E, Claude J, Strimmer K: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 2004, 20(2):289-290.
- [19]Junier T, Zdobnov EM: The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 2010, 26(13):1669-1670.
- [20]Gan X, Stegle O, Behr J, Steffen JG, Drewe P, Hildebrand KL, Lyngsoe R, Schultheiss SJ, Osborne EJ, Sreedharan VT, Kahles A, Bohnert R, Jean G, Derwent P, Kersey P, Belfield EJ, Harberd NP, Kemen E, Toomajian C, Kover PX, Clark RM, Ratsch G, Mott R: Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 2011, 477(7365):419-423.
- [21]Nordborg M, Hu TT, Ishino Y, Jhaveri J, Toomajian C, Zheng H, Bakker E, Calabrese P, Gladstone J, Goyal R, Jakobsson M, Kim S, Morozov Y, Padhukasahasram B, Plagnol V, Rosenberg NA, Shah C, Wall JD, Wang J, Zhao K, Kalbfleisch T, Schulz V, Kreitman M, Bergelson J: The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol 2005, 3(7):e196.