期刊论文详细信息
GigaScience
The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
Daisuke Kihara2  Dukka B. KC3  Samuel Chapman3  Qing Wei1  Ishita K. Khan1 
[1] Department of Computer Sciences, Purdue University, West Lafayette 47907, IN, USA;Department of Biological Sciences, Purdue University, West Lafayette 47907, IN, USA;Department of Computational Science and Engineering, North Carolina A & T State University, Greensboro 27411, NC, USA
关键词: gene annotation;    ensemble method;    consensus method;    ESG;    PFP;    function prediction;    CAFA;    sequence;    Protein function;   
Others  :  1224895
DOI  :  10.1186/s13742-015-0083-4
 received in 2014-12-31, accepted in 2015-08-27,  发布年份 2015
PDF
【 摘 要 】

Background

Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013–2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets.

Results

For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed.

Conclusions

Updating the annotation database was successful, improving the F maxprediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average F maxscore over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general.

【 授权许可】

   
2015 Khan et al.

【 预 览 】
附件列表
Files Size Format View
20150915051516497.pdf 992KB PDF download
Fig. 4. 27KB Image download
Fig. 3. 26KB Image download
Fig. 2. 28KB Image download
Fig. 1. 40KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

【 参考文献 】
  • [1]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W et al.. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25:3389-402.
  • [2]Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990; 183:63-98.
  • [3]Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988; 85:2444-8.
  • [4]Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL et al.. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003; 31:400-2.
  • [5]Bru C, Courcelle E, Carrère S, Beausse Y, Dalmar S, Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 2005;212–5.
  • [6]Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR et al.. The Pfam protein families database. Nucleic Acids Res. 2014; 42:D222-30.
  • [7]Pietrokovski S, Henikoff JG, Henikoff S. The Blocks database -- a system for protein classification. Nucleic Acids Res. 1996; 24:197-200.
  • [8]Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A et al.. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012; 40:D306-12.
  • [9]Khan S, Situ G, Decker K, Schmidt CJ. GoFigure: Automated Gene Ontology annotation. Bioinformatics. 2003; 19:2484-5.
  • [10]Zehetner G. OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res. 2003; 31:3799-803.
  • [11]Martin D, Berriman M, Barton G. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics. 2004; 5:178-94. BioMed Central Full Text
  • [12]Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S et al.. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics. 2006; 7:161-7. BioMed Central Full Text
  • [13]Hawkins T, Luban S, Kihara D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 2006; 15:1550-6.
  • [14]Hawkins T, Chitale M, Luban S, Kihara D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins Struct Funct Bioinf. 2009; 74:566-82.
  • [15]Wass MN, Sternberg MJ. ConFunc--functional annotation in the twilight zone. Bioinformatics. 2008; 24:798-806.
  • [16]Chitale M, Hawkins T, Park C, Kihara D. ESG: extended similarity group method for automated protein function prediction. Bioinformatics. 2009; 25:1739-45.
  • [17]Engelhardt BE, Jordan MI, Muratore KE, Brenner SE. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput Biol. 2005; 1: Article ID e45
  • [18]Krishnamurthy N, Brown D, Sjölander K. FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function. BMC Evol Biol. 2007; 7:S12. BioMed Central Full Text
  • [19]Storm CEV, Sonnhammer ELL. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics. 2002; 18:92-9.
  • [20]Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS et al.. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A. 2000; 97:262-7.
  • [21]Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95:14863-8.
  • [22]Gao L, Li X, Guo Z, Zhu M, Li Y, Rao S. Widely predicting specific protein functions based on protein-protein interaction data and gene expression profile. Sci China C Life Sci. 2007; 50:125-34.
  • [23]Khatri P, Drâghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005; 21:3587-95.
  • [24]van Noort V, Snel B, Huynen MA. Predicting gene function by conserved co-expression. Trends Genet. 2003; 19:238-42.
  • [25]Gherardini PF, Helmer-Citterich M. Structure-based function prediction: approaches and applications. Brief Funct Genomic Proteomic. 2008; 7:291-302.
  • [26]Marti-Renom M, Rossi A, Al-Shahrour F, Davis F, Pieper U, Dopazo J et al.. The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics. 2007; 8:S4. BioMed Central Full Text
  • [27]Martin ACR, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M, Laskowski RA et al.. Protein folds and functions. Structure. 1998; 6:875-84.
  • [28]Pal D, Eisenberg D. Inference of protein function from protein structure. Structure. 2005; 13:121-30.
  • [29]Ponomarenko JV, Bourne PE, Shindyalov IN. Assigning new GO annotations to protein data bank sequences by combining structure and sequence homology. Proteins Struct Funct Bioinf. 2005; 58:855-65.
  • [30]Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA. From structure to function: approaches and limitations. Nat Struct Biol. 2000; 7:991-4.
  • [31]Chikhi R, Sael L, Kihara D. Real-time ligand binding pocket database search using local surface descriptors. Proteins Struct Funct Bioinf. 2010; 78:2007-28.
  • [32]Sael L, Kihara D. Binding ligand prediction for proteins using partial matching of local surface patches. Int J Mol Sci. 2010; 11:5009-26.
  • [33]Sael L, Chitale M, Kihara D. Structure- and sequence-based function prediction for non-homologous proteins. J Struct Funct Genomics. 2012; 13:111-23.
  • [34]Zhu X, Xiong Y, Kihara D. Large-scale binding ligand prediction by improved patch-based method Patch-Surfer2.0. Bioinformatics. 2015; 31:707-13.
  • [35]Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003; 5:R6. BioMed Central Full Text
  • [36]Chua HN, Sung WK, Wong L. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006; 22:1623-30.
  • [37]Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics. 2003; 19 Suppl 1:i197-204.
  • [38]Nariai N, Kolaczyk ED, Kasif S. Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS One. 2007; 2: Article ID e337
  • [39]Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol. 2007; 3:88-100.
  • [40]Deng M, Tu Z, Sun F, Chen T. Mapping gene ontology to proteins based on protein-protein interaction data. Bioinformatics. 2004; 20:895-902.
  • [41]Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A et al.. A large-scale evaluation of computational protein function prediction. Nat Meth. 2013; 10:221-7.
  • [42]Seok Y, Sondej M, Badawi P, Lewis M, Briggs M, Jaffe H et al.. High affinity binding and allosteric regulation of Escherichia coli glycogen phosphorylase by the histidine phosphocarrier protein. HPr. J Biol Chem. 1997; 272:26511-21.
  • [43]D'Ari L, Rabinowitz J. Purification, characterization, cloning, and amino acid sequence of the bifunctional enzyme 5,10-methylenetetrahydrofolate dehydrogenase/5,10-methenyltetrahydrofolate cyclohydrolase from Escherichia coli. J Biol Chem. 1991; 266:23953-8.
  • [44]Lopez G, Rojas A, Tress M, Valencia A. Assessment of predictions submitted for the CASP7 function prediction category. Proteins Struct Funct Bioinf. 2007; 69:165-74.
  • [45]Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res. 2008; 36:W297-302.
  • [46]Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011; 9:173-5.
  • [47]Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014; 42:D191-8.
  • [48]Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR et al.. PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 2004; 32:D112-4.
  • [49]Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B et al.. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005; 33:D428-32.
  • [50]Chitale M, Khan IK, Kihara D. In-depth performance evaluation of PFP and ESG sequence-based function prediction methods in CAFA 2011 experiment. BMC Bioinformatics. 2013; 14:S2. BioMed Central Full Text
  • [51]Khan IK, Wei Q, Chitale M, Kihara D. PFP/ESG: automated protein function prediction servers enhanced with Gene Ontology visualization tool. Bioinformatics. 2014; 31:271-2.
  • [52]Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998; 1:55-67.
  • [53]Minneci F, Piovesan D, Cozzetto D, Jones DT. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS One. 2013; 8: Article ID e63754
  • [54]Lobley A, Swindells MB, Orengo CA, Jones DT. Inferring function using patterns of native disorder in proteins. PLoS Comput Biol. 2007; 3: Article ID e162
  • [55]Joachims T. Making large-scale support vector machine learning practical. In: Advances in Kernel Methods - Support Vector Learning. MIT, Cambridge; 1999: p.169-84.
  • [56]Piatigorsky J. Multifunctional lens crystallins and corneal enzymes. More than meets the eye. Ann N Y Acad Sci. 1998; 842:7-15.
  • [57]Breazeale S, Ribeiro A, McClerren A, Raetz C. A formyltransferase required for polymyxin resistance in Escherichia coli and the modification of lipid A with 4-amino-4-deoxy-L-arabinose. Identification and function oF UDP-4-deoxy-4-formamido-L-arabinose. J Biol Chem. 2005; 280:14154-67.
  • [58]Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data. 1994;487–99.
  • [59]Tao F, Murtagh F, Farid M. Weighted association rule mining using weighted support and significance framework. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003;661–6
  • [60]Ishita K. Khan; Qing Wei; Samuel Chapman; Dukka B. KC; Daisuke Kihara (2015): Supporting data and materials for "The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches". GigaScience Database. http://dx. doi.org/10.5524/100161
  文献评价指标  
  下载次数:53次 浏览次数:8次