期刊论文详细信息
BMC Bioinformatics
Genome sequence-based species delimitation with confidence intervals and improved distance functions
Jan P Meier-Kolthoff2  Alexander F Auch1  Hans-Peter Klenk2  Markus Göker2 
[1] Eberhard-Karls-Universität, Tübingen, Germany
[2] Leibniz Institute DSMZ – German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany
关键词: Taxonomy;    Species concept;    Phylogeny;    MUMmer;    Genomics;    GBDP;    GGDC;    GGD;    DDH;    BLAST;    Bacteria;    Archaea;   
Others  :  1087974
DOI  :  10.1186/1471-2105-14-60
 received in 2012-11-26, accepted in 2013-02-04,  发布年份 2013
PDF
【 摘 要 】

Background

For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept.

Results

Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. GBDP was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, GBDP obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions.

Conclusions

Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de webcite, are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms.

【 授权许可】

   
2013 Meier-Kolthoff et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117062446172.pdf 927KB PDF download
Figure 6. 37KB Image download
Figure 5. 47KB Image download
Figure 4. 55KB Image download
Figure 3. 34KB Image download
Figure 2. 75KB Image download
Figure 1. 56KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

【 参考文献 】
  • [1]Wayne LG, Brenner DJ, Colwell RR, Grimont PaD, Kandler O, Krichevsky MI, Moore LH, Moore WEC, Murray RGE, Stackebrandt E, Starr MP, Truper HG: Report of the Ad Hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol 1987, 37(4):463-464.
  • [2]Stackebrandt E, Goebel BM: Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 1994, 44(4):846-849.
  • [3]Schleifer K: Classification of Bacteria and Archaea: past, present and future. Syst Appl Microbiol 2009, 32(8):533-542.
  • [4]Klenk HP, Göker M: En route to a genome-based classification of Archaea and Bacteria? Syst Appl Microbiol 2010, 33(4):175-182.
  • [5]Vandamme P, Pot B, Gillis M, de Vos P: Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev 1996, 60(2):407-438.
  • [6]Goris J, Konstantinidis K, Klappenbach J, Coenye T, Vandamme P, Tiedje J: DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 2007, 57:81-91.
  • [7]Richter M, Rossello R: Shifting the genomic gold standard for the prokaryotic species definition. Proc Nat Acad Sci 2009, 106(45):19126-19131.
  • [8]Auch AF, von Jan M, Klenk HP, Göker M: Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison. Stand Genomic Sci 2010, 2:117-134.
  • [9]De Ley J, De Smedt J: Improvements of the membrane filter method for DNA:rRNA hybridization. Antonie van Leeuwenhoek 1975, 41:287-307.
  • [10]Klenk HP, Haas B, Schwass V, Zillig W: Hybridization homology: a new parameter for the analysis of phylogenetic relations, demonstrated with the urkingdom of the archaebacteria. J Mol Evol 1986, 24:167-173.
  • [11]Woese CR, Kandler O, Wheelis ML: Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Nat Acad Sci 1990, 87(12):4576-4579.
  • [12]Henz S, Huson D, Auch AF, Nieselt-Struwe K, Schuster S: Whole-genome prokaryotic phylogeny. Bioinformatics 2005, 21(10):2329-2335.
  • [13]Auch AF, Henz S, Holland B, Göker M: Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics 2006, 7:350. BioMed Central Full Text
  • [14]Auch AF, Henz SR, Göker M: Phylogenies from whole genomes: Methodological update within a distance-based framework. German conference on Bioinformatics, Tübingen 2006. Tübingen [http://nbn-resolving.de/urn:nbn:de:bsz:21-opus-34178 webcite]
  • [15]Auch AF: A phylogenetic potpourri – Computational methods for analysing genome-scale data. PhD thesis. Universität Tübingen, Wilhelmstr. 32, 72074 Tübingen 2009, [http://nbn-resolving.de/urn:nbn:de:bsz:21-opus-44779 webcite]
  • [16]Auch AF, Klenk HP, Göker M: Standard operating procedure for calculating genome-to-genome distances based on high-scoring segment pairs. Stand Genomic Sci 2010, 2:142-148.
  • [17]Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410.
  • [18]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406-425.
  • [19]Göker M, Grimm GW, Auch AF, Aurahs R, Kučera M: A clustering optimization strategy for molecular taxonomy applied to planktonic foraminifera ssU rDnA. Evol Bioinf 2010, 6:97-112.
  • [20]Motulsky H, Christopoulos A: Fitting Models to Biological Data Using Linear and Nonlinear Regression: A Practical Guide to Curve Fitting. Oxford: Oxford University Press; 2004.
  • [21]Fletcher D, MacKenzie D, Villouta E: Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression. Environ Ecol Stat 2005, 12:45-54.
  • [22]Lin SM, Du P, Huber W, Kibbe Wa: Model-based variance-stabilizing transformation for Illumina microarray data. Nucleic acids Res 2008, 36(2):e11.
  • [23]Efron B: Bootstrap methods: another look at the jackknife. Ann Stat 1979, 7:1-26.
  • [24]Miller RG: The jackknife – a review. Biometrika 1974, 61:1-15.
  • [25]Pagani I, Liolios K, Jansson J, Chen IMa, Kyrpides NC, Smirnova T: The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic acids Res 2012, 40(Database issue):D571—D579.
  • [26]Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden T: BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421. BioMed Central Full Text
  • [27]Korf I, Yandell M, Bedell J: BLAST. Sebastopol: O’Reilly Media; 2003.
  • [28]Legendre P, Legendre L: Numerical Ecology. Amsterdam: Elsevier; 1998.
  • [29]Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. BioMed Central Full Text
  • [30]Kent W: BLAT – the BLAST-like alignment tool. Genome Res 2002, 12(4):656-664.
  • [31]Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13:103-107.
  • [32]Bader D, Pennington R: Cluster computing: applications. Int J High Perform Comput 2001, 15(2):181-185.
  • [33]BwGRiD: Member of the German D-Grid initiative, funded by the Ministry of Education and Research and the Ministry for Science, Research and Arts Baden-Wuerttemberg (2007-2012). Tech. rep. Universities of Baden-Württemberg 2012. [http://www.bw-grid.de/ webcite]
  • [34]Meier-Kolthoff JP, Auch AF, Klenk HP, Göker M: GBDP on the grid: a genome-based approach for species delimitation adjusted for an automated and highly parallel processing of large data sets. In Hochleistungsrechnen in Baden-Württemberg – Ausgewählte Aktivitäten im bwGRiD 2012. Karlsruhe: KIT Scientific Publishing; Forthcoming 2013.
  • [35]Sokal R, Rohlf F: Biometry: The Principles and Practice of Statistics in Biological Research. San Francisco: W.H. Freeman and Company; 1969.
  • [36]R Development Core Team: R: a Language and Environment for Statistical computing. Vienna: R Foundation for Statistical Computing; 2011. [http://www.r-project.org webcite]
  • [37]Crawley MJ: The R book. Chichester: Wiley Publishing; 2007.
  • [38]Venables WN, Ripley BD: Modern Applied Statistics with S. New York: Springer; 2002.
  • [39]Fox J: Effect displays in {R} for generalised linear models. J Stat Software 2003, 8(15):1-27.
  • [40]Grömping U: Relative importance for linear regression in R: the package relaimpo. J Stat Software 2006, 17:1-27.
  • [41]Hendricks W, Robey K: The sampling distribution of the coefficient of variation. Ann Math Stat 1936, 7(3):129-132.
  • [42]Nelder JA, Wedderburn RWM: Generalized linear models. J R Stat Soc 1972, 135(3):370-384.
  • [43]Hastie T, Tibshirani R: Generalized Additive Models. London: Chapman & Hall/CRC; 1990.
  • [44]Cleveland W: Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 1979, 74(368):829-836.
  • [45]Wood SN: Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc (B) 2011, 73:3-36.
  • [46]Akaike H: A new look at the statistical model identification. IEEE Trans Autom Control 1974, 19(6):716-723.
  • [47]Wickham H: Ggplot2: Elegant Graphics for Data Analysis. New York: Springer; 2009.
  • [48]Hilbe J: Negative Binomial Regression. Cambridge: Cambridge Univ Pr; 2011.
  • [49]Mueller LD, Ayala F J: Estimation and interpretation of genetic distance in empirical studies. Genetical Res 1982, 40:127-137.
  • [50]Penny D, Hendy MD: Testing methods of evolutionary tree construction. Cladistics 1985, 1(3):266-278.
  • [51]Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution 1985, 39(4):783-791.
  • [52]Penny D, Hendy M: Estimating the reliability of evolutionary trees. Mol Biol Evol 1986, 3(5):403-417.
  • [53]Felsenstein J: Inferring Phylogenies. Sunderland: Sinauer Associates; 2004.
  • [54]Thorne J, Kishino H: Freeing phylogenies from artifacts of alignment. Mol Biol and Evol 1992, 9(6):1148-1162.
  • [55]Clarke GDP, Beiko RG, Ragan MA, Charlebois RL: Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol 2002, 184(8):2072-2080.
  文献评价指标  
  下载次数:38次 浏览次数:13次