期刊论文详细信息
BioData Mining
Supervised DNA Barcodes species classification: analysis, comparisons and results
Emanuel Weitschek2  Giulia Fiscon1  Giovanni Felici2 
[1] Department of Computer, Control, and Management Engineering, Sapienza University, Via Ariosto, 25, 00185 Rome, Italy
[2] Institute of Systems Analysis and Computer Science Antonio Ruberti, National Research Council, Viale Manzoni, 30, 00185 Rome, Italy
关键词: Species identification;    Supervised classification methods;    DNA Barcoding;   
Others  :  795105
DOI  :  10.1186/1756-0381-7-4
 received in 2013-11-18, accepted in 2014-04-05,  发布年份 2014
PDF
【 摘 要 】

Background

Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.

Methods

In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.

Results

A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.

Conclusions

The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.

【 授权许可】

   
2014 Weitschek et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140705081707247.pdf 1177KB PDF download
Figure 6. 79KB Image download
Figure 5. 76KB Image download
Figure 4. 49KB Image download
Figure 3. 61KB Image download
Figure 2. 48KB Image download
Figure 1. 68KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

【 参考文献 】
  • [1]Hebert PDN, Cywinska A, Ball SL, DeWaard J: Biological identifications through DNA barcodes. Proc R Soc B 2003, 270:313-321.
  • [2]Hebert PDN, Ratnasingham S, de Waard J: Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc R Soc B 2003, 270(Suppl 1):S96-S99.
  • [3]CBOL Plant Working Group: A DNA barcode for land plants. Proc Natl Acad Sci U S A 2009, 106(31):12794-12797.
  • [4]Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA, Chen W, Fungal Barcoding Consortium: Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc Natl Acad Sci USA 2012, 109(16):6241-6246.
  • [5]Hebert PDN, Gregory T: The promise of DNA barcoding for taxonomy. Syst Biol 2005, 54:852-859.
  • [6]Schindel D, Miller S: DNA barcoding a useful tool for taxonomists. Nature 2005, 435:17-17.
  • [7]Weitschek E, van Velzen R, Felici G, Bertolazzi P: BLOG 2.0: a software system for character‒based species classification with DNA Barcode sequences: what it does, how to use it. Mol Ecol Resour 2013, 13(6):1043-1046.
  • [8]Van Velzen R, Weitschek E, Felici G, Bakker FT: DNA Barcoding of recently diverged species: relative performance of matching methods. PLoS One 2012, 7(1):e30490.
  • [9]Farris JS: Estimating phylogenetic trees from distance matrices. Am Nat 1972, 106(951):645-668.
  • [10]Saitou N, Nei M: The neighbour-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425.
  • [11]Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R: Statistical assignment of DNA sequences using Bayesian phylogenetics. Syst Biol 2008, 57(5):750-757.
  • [12]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
  • [13]Austerlitz F, David O, Schaeffer B, Bleakley K, Olteanu M, Leblois R, Veuille M, Laredo C: DNA barcode analysis: a comparison of phylogenetic and statistical classification methods. BMC Bioinforma 2009, 14(Suppl 10):S10.
  • [14]Meier R, Kwong S, Vaidya G, Ng Peter KL: DNA barcoding and taxonomy in diptera: a tale of high intraspecific variability and low identification success. Syst Biol 2006, 55:715-728.
  • [15]DasGupta B, Konwar KM, Măndoiu II, Shvartsman AA: DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics 2005, 21(16):3424-3426.
  • [16]Sarkar IN, Planet PJ, DeSalle R: CAOS software for use in character-based DNA barcoding. Mol Ecol Resour 2008, 8(6):1256-1259.
  • [17]Little DP: DNA barcode sequence identification incorporating taxonomic hierarchy and within taxon variability. PLoS One 2011, 6(8):e20552.
  • [18]Little DP: BRONX2: Barcode Recognition Obtained with Nucleotide eXposés 2.0. 2012. Program distributed by the author http://www.nybg.org/files/scientists/dlittle/BRONX2.html webcite.
  • [19]Liu C, Liang D, Gao T, Pang X, Song J, Yao H, Chen S: PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region. BMC Bioinforma 2011, 12(Suppl 13):S4. BioMed Central Full Text
  • [20]Albu M, Nikbakht H, Hajibabaei M, Hickey DA: The DNA barcode linker. Mol Ecol Resour 2011, 11:84-88.
  • [21]Kuksa P, Pavlovic V: Efficient alignment-free DNA barcode analytics. BMC Bioinforma 2009, 10(Suppl 14):S9. BioMed Central Full Text
  • [22]Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explorations 2009, 11(1):10-18.
  • [23]Platt JC: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods - Support Vector Learning. Edited by Scholkopf B, Burges C, Platt JC, Smola AJ. Cambridge MA: MIT Press; 1998:185-208.
  • [24]Cohen WW: Fast effective rule induction. Twelfth International Conference on Machine Learning (ICML) 1995, 95:115-123.
  • [25]Quinlan R: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. San Mateo CA: Morgan Kaufmann; 1993.
  • [26]John GH, Langley P: Estimating Continuous Distributions in Bayesian Classifiers. In Eleventh Conference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann; 1995:338-345.
  • [27]Bertolazzi P, Felici G, Weitschek E: Learning to classify species with barcodes. BMC Bioinforma 2009, 10(Suppl 14):S7. BioMed Central Full Text
  • [28]Felici G, Truemper K: A MINSAT approach for learning in logic domains. Informs J Comput 2002, 14:20-36.
  • [29]Meyer CP, Paulay G: DNA barcoding: Error rates based on comprehensive sampling. PLoS Biol 2005, 3(12):2229-2238.
  • [30]Lou M, Golding GB: Assigning sequences to species in the absence of large interspecific differences. Mol Phylogenet Evol 2010, 56:187-194.
  • [31]Dexter KG, Pennington TD, Cunningham CW: Using DNA to assess errors in tropical tree identifications: how often are ecologists wrong and when does it matter? Ecol Monogr 2010, 80:267-286.
  • [32]Ratnasingham S, Hebert PDN: Bold: the barcode of life data system. Mol Ecol Notes 2007, 7:355-364.
  • [33]Hebert PDN, Stoeckle MY, Zemlak TS, Francis CM: Identification of birds through COI DNA barcodes. PLoS Biol 2004, 2:1-7.
  • [34]Bishop CM: Neural Networks for Pattern Recognition. Walton Street, Oxford: Oxford university press; 1995.
  • [35]Wilcoxon F: Probability tables for individual comparisons by ranking methods. Biometrics 1947, 3(3):119-122.
  • [36]Bonferroni C: Il calcolo delle assicurazioni su gruppi di teste. Rome: Tipografi del Senato; 1935:13-60. [Studi in Onore del Professore Salvatore Ortu Carboni]
  • [37]Lehr T, Yuan J, Zeumer D, Jayadev S, Ritchie MD: Rule-based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies. BioData Mining 2010, 4(1):4.
  文献评价指标  
  下载次数:70次 浏览次数:4次