期刊论文详细信息
Journal of Biomedical Semantics
Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct
Karin M Verspoor2  Asa Ben-Hur1  Indika Kahanda1  Christopher S Funk3 
[1] Department of Computer Science, Colorado State University, Fort Collins 80523, CO, USA;Health and Biomedical Informatics Centre, University of Melbourne, Parkville 3010, Victoria, Australia;Computational Bioscience Program, University of Colorado School of Medicine, Aurora 80045, CO, USA
关键词: Biomedical concept recognition;    Protein function prediction;    Text mining;   
Others  :  1145433
DOI  :  10.1186/s13326-015-0006-4
 received in 2014-11-03, accepted in 2015-02-27,  发布年份 2015
PDF
【 摘 要 】

Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a “medium-throughput” pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.

【 授权许可】

   
2015 Funk et al.; licensee BioMed Central.

【 预 览 】
附件列表
Files Size Format View
20150402091605286.pdf 1173KB PDF download
Figure 3. 108KB Image download
Figure 2. 36KB Image download
Figure 1. 27KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Radivoja P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al.: A large-scale evaluation of computational protein function prediction. Nat Methods 2013, 10.3(2013):221-7.
  • [2]Verspoor KM: Roles for text mining in protein function prediction. [http://dx.doi.org/10.1007/978-1-4939-0709-0_6] webciteIn Biomedical Literature Mining. Methods in Molecular Biology, vol. 1159 Edited by Kumar VD, Tipney HJ. Springer, New York; 2014. http://dx.doi.org/10.1007/978-1-4939-0709-0_6
  • [3]Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L: Manual curation is not sufficient for annotation of genomic databases. Bioinf 2007, 23(13):41-8.
  • [4]The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology Nat Genet 2000, 25(1):25-9.
  • [5]Wong A, Shatkay H: Protein function prediction using text-based features extracted from biomedical literature: The cafa challenge. BMC Bioinf 2013, 14(Suppl 3):S14.
  • [6]Shatkay H, Brady S, Wong A: Text as data: Using text-based features for proteins representation and for computational prediction of their characteristics. Methods 2015, 74:54-64. http://www.sciencedirect.com/science/article/pii/S1046202314003533
  • [7]Björne J, Salakoski T: A machine learning model and evaluation of text mining for protein function prediction. [http://iddo-friedberg.net/afp-cafa-2011-booklet.pdf] webciteAutomated Function Prediction Featuring a Critical Assessment of Function Annotations (AFP/CAFA) 2011 Automated Function Prediction – an ISMB Special Interest Group, Vienna, Austria; 2011. http://iddo-friedberg.net/afp-cafa-2011-booklet.pdf
  • [8]Sokolov A, Funk C, Graim K, Verspoor K, Ben-Hur A. Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinf. 2013; 14(Suppl 3).
  • [9]Sokolov A, Ben-Hur A: Hierarchical classification of gene ontology terms using the gostruct method. J Bioinf Comput Biol 2010, 8(02):357-76.
  • [10]Funk C, Kahanda I, Ben-Hur A, Verspoor K. Evaluating a variety of text-mined features for automatic protein function prediction. In: Proceedings of the BioOntologies SIG at ISMB’14: 2014. p. 13–17. [Online]. Available: http://tinyurl.com/bioont2014.
  • [11]Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, et al.: The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res 2004, 32(suppl 1):262-6.
  • [12]IBM. UIMA Java Framework. 2009. http://uima-framework.sourceforge.net/.
  • [13]Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, et al.: A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinf 2012, 13:207. BioMed Central Full Text
  • [14]Tanenblatt M, Coden A, Sominsky I. The conceptmapper approach to named entity recognition. In: International Conference on Language Resources and Evaluation. Proceedings of the NLP Frameworks Workshop at the Language Resources and Evaluation Conference (LREC): 2010. p. 9–14.
  • [15]Liu H, Hu Z-Z, Zhang J, Wu C: Biothesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 2006, 22(1):103-5.
  • [16]Funk C, Baumgartner W, Garcia B, Roeder C, Bada M, Cohen K, et al.: Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters. BMC Bioinf 2014, 15(1):59. BioMed Central Full Text
  • [17]Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, et al.: Concept annotation in the CRAFT corpus. BMC Bioinf 2012, 13:161. BioMed Central Full Text
  • [18]Bada M, Sitnikov D, Blake JA, Hunter LE: Occurrence of gene ontology, protein ontology, and ncbi taxonomy concepts in text toward automatic gene ontology annotation of genes and gene products. [http://biolinksig.org/proceedings/2013/biolinksig2013_Bada_etal.pdf] webciteBioLink – an ISMB Special Interest Group Proceedings of BioLINK SIG 2013, Berlin, Germany; 2013. http://biolinksig.org/proceedings/2013/biolinksig2013_Bada_etal.pdf
  • [19]Mao Y, Van Auken K, Li D, Arighi CN, McQuilton P, Hayman GT, et al.: Overview of the gene ontology task at biocreative iv. Database 2014, 2014:086.
  • [20]Jacob C, Thomas P, Ulf L: Comprehensive benchmark of gene ontology concept recognition tools. [https:/ / informatik.huberlin.de/ forschung/ gebiete/ wbi/ research/ publications/ 2013/ biolinksig2013_jacob_etal.pdf] webciteProceedings of BioLINK Special Interest Group Proceedings of BioLINK SIG 2013, Berlin, Germany; 2013. [Online] Available: https://informatik.huberlin.de/forschung/gebiete/wbi/research/publications/2013/biolinksig2013_jacob_etal.pdf
  • [21]Joslyn CA, Mniszewski SM, Fulmer A, Heaton G: The gene ontology categorizer. Bioinformatics 2004, 20(suppl 1):169-77.
  • [22]Resnik P. Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007. 1995.
  • [23]Ze-Min Y, Wei-Wen C, Ying-Fang W: [research on differentially expressed genes related to substance and energy metabolism between healthy volunteers and splenasthenic syndrome patients with chronic superficial gastritis]. Zhongguo Zhong xi yi jie he za zhi Zhongguo Zhongxiyi jiehe zazhi= Chinese journal of integrated traditional and Western medicine/Zhongguo Zhong xi yi jie he xue hui, Zhongguo Zhong yi yan jiu yuan zhu ban 2013, 33(2):159-63.
  • [24]Tidhar R, Ben-Dor S, Wang E, Kelly S, Merrill AH, Futerman AH: Acyl chain specificity of ceramide synthases is determined within a region of 150 residues in the tram-lag-cln8 (tlc) domain. J Biol Chem 2012, 287(5):3197-206.
  • [25]Verspoor K, Cohn J, Mniszewski S, Joslyn C: A categorization approach to automated ontological function annotation. Protein Sci 2006, 15(6):1544-9.
  • [26]Clark WT, Radivojac P: Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 2013, 29(13):53-61.
  文献评价指标  
  下载次数:23次 浏览次数:17次