| GigaScience | |
| A close look at protein function prediction evaluation protocols | |
| Asa Ben-Hur2  Karin M Verspoor1  Fahad Ullah2  Christopher S Funk3  Indika Kahanda2  | |
| [1] Department of Computing and Information Systems, University of Melbourne, 3010 Parkville, Victoria, Australia;Department of Computer Science, Colorado State University, Fort Collins 80523, CO, USA;Computational Bioscience Program, University of Colorado School of Medicine, Aurora 80045, CO, USA | |
| 关键词: Support vector machines; Machine learning; Gene Ontology; Automated function prediction; | |
| Others : 1224897 DOI : 10.1186/s13742-015-0082-5 |
|
| received in 2015-01-01, accepted in 2015-08-24, 发布年份 2015 | |
PDF
|
|
【 摘 要 】
Background
The recently held Critical Assessment of Function Annotation challenge (CAFA2) required its participants to submit predictions for a large number of target proteins regardless of whether they have previous annotations or not. This is in contrast to the original CAFA challenge in which participants were asked to submit predictions for proteins with no existing annotations. The CAFA2 task is more realistic, in that it more closely mimics the accumulation of annotations over time. In this study we compare these tasks in terms of their difficulty, and determine whether cross-validation provides a good estimate of performance.
Results
The CAFA2 task is a combination of two subtasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In this study we analyze the performance of several function prediction methods in these two scenarios. Our results show that several methods (structured support vector machine, binary support vector machines and guilt-by-association methods) do not usually achieve the same level of accuracy on these two tasks as that achieved by cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We also find that different methods have different performance characteristics in these tasks, and that cross-validation is not adequate at estimating performance and ranking methods.
Conclusions
These results have implications for the design of computational experiments in the area of automated function prediction and can provide useful insight for the understanding and design of future CAFA competitions.
【 授权许可】
2015 Kahanda et al.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20150915052141572.pdf | 976KB | ||
| Fig. 3. | 53KB | Image | |
| Fig. 2. | 62KB | Image | |
| Fig. 1. | 39KB | Image |
【 图 表 】
Fig. 1.
Fig. 2.
Fig. 3.
【 参考文献 】
- [1]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM et al.. Gene Ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25-9.
- [2]Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P et al.. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 2010; 38(suppl 2):W214-220.
- [3]Clark WT, Radivojac P. Analysis of protein function and its prediction from amino acid sequence. Proteins: Structure, Function, and Bioinform. 2011; 79(7):2086-96.
- [4]Sokolov A, Ben-Hur A. Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method. J Bioinform Comput Biology. 2010; 8(2):357-76.
- [5]Radivojac P, Clark WT, Friedberg I et al.. A large-scale evaluation of computational protein function prediction. Nat Meth. 2013; 10(3):221-7.
- [6]Moult J, Pedersen JT, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Proteins: Struct Func Bioinform. 1995; 23(3):ii-iv.
- [7]Janin J, Henrick K, Moult J, Eyck LT, Sternberg MJE, Vajda S et al.. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins: Struct Func Bioinform. 2003; 52(1):2-9.
- [8]Automated Protein Function Prediction Special Interest Group website. [http://biofunctionprediction.org/], access date 9 Sept 2015.
- [9]Schnoes AM, Ream DC, Thorman AW, Babbitt PC, Friedberg I. Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space. PLoS Comput Biol. 2013;9(5). Available from: doi:10.1371/journal.pcbi.1003063, access date 9 Sept 2015.
- [10]Anton BP, Chang YC, Brown P, Choi HP, Faller LL, Guleria J et al.. The COMBREX project: design, methodology, and initial results. PLoS Biol. 2013; 11(8):e1001638.
- [11]Gene Ontology website. [http://www.geneontology.org/], access date 9 Sept 2015.
- [12]Uniprot-GOA website. [http://www.ebi.ac.uk/GOA], access date 9 Sept 2015.
- [13]Sokolov A, Funk C, Graim K, Verspoor K, Ben-Hur A. Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinform. 2013; 14(3):1-13.
- [14]Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein function. Cell Mol Life Sci. 2003; 60(12):2637-50.
- [15]Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ et al.. WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007; 35(suppl 2):W585-87.
- [16]Krogh A, von Heijne G, Sonnhammer ELL. Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol. 2001; 305(3):567-80.
- [17]Coletta A, Pinney JW, Solis DYW, Marsh J, Pettifer S, Attwood T. Low-complexity regions within protein sequences have position-dependent roles. BMC Syst Biol. 2010; 4(1):1-13.
- [18]Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K et al.. BLAST+: architecture and applications. BMC Bioinforma. 2009; 10(1):1-9.
- [19]Chatr-aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C et al.. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013; 41(D1):D816-23.
- [20]Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2010. Available from: http://nar.oxfordjournals.org/content/early/2010/11/02/nar.gkq973.abstract, access date 9 Sept 2015.
- [21]GeneMANIA datasets. [http://pages.genemania.org/data/, access date 9 Sept 2015.
- [22]Funk CS, Kahanda I, Ben-Hur A, Verspoor KM. Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct. J Biomed Semantics. 2015; 6(1):9. BioMed Central Full Text
- [23]Tsochantaridis I, Joachims T, Hofmann T, Altun Y. Large Margin Methods for Structured and Interdependent Output Variables. J Mach Learn Res. 2005; 6:1453-84.
- [24]The Strut library. http://sourceforge. net/projects/strut/
- [25]PyML website. [http://pyml.sourceforge.net], access date 9 Sept 2015.
- [26]Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol. 2007;3(1). Available from: doi:10.1038/msb4100129, access date 9 Sept 2015.
- [27]Gillis J, Pavlidis P. "Guilt by Association" Is the Exception Rather Than the Rule in Gene Networks. PLoS Comput Biol. 2012;8(3). Available from: doi:10.1371/journal.pcbi.1002444, access date 9 Sept 2015.
- [28]Gillis J, Pavlidis P. The Impact of Multifunctional Genes on “Guilt by Association”s Analysis. PLoS ONE. 2011; 02;6(2):e17258.
- [29]Jiang Y, Clark WT, Friedberg I, Radivojac P. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics. 2014; 30(17):i609-i616.
- [30]Yu G, Zhu H, Domeniconi C. Predicting protein functions using incomplete hierarchical labels. BMC Bioinform. 2015; 16(1):1. BioMed Central Full Text
- [31]Guan Y, Myers C, Hess D, Barutcuoglu Z, Caudy A, Troyanskaya O. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008;9(Suppl 1). Available from: doi:10.1186/gb-2008-9-s1-s3, access date 9 Sept 2015.
- [32]Kahanda I, Funk C, Ullah F, Verspoor K, Ben-Hur A. Supporting data for “A close look at protein function prediction evaluation protocols”. GigaScience Database. 2015. doi:10.5524/100153, access date 9 Sept 2015.
PDF