期刊论文详细信息
Journal of Biomedical Semantics
Ambiguity and variability of database and software names in bioinformatics
Goran Nenadic1  Robert Stevens3  David L. Robertson4  Aleksandar Kovacevic2  Geraint Duck3 
[1] Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK;Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia;School of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, UK;Computational and Evolutionary Biology, Faculty of Life Sciences, The University of Manchester, Oxford Road, Manchester M13 9PT, UK
关键词: Text-mining;    Resource extraction;    Dictionary;    CRF;    Computational biology;    Bioinformatics;   
Others  :  1218700
DOI  :  10.1186/s13326-015-0026-0
 received in 2013-07-08, accepted in 2015-06-05,  发布年份 2015
PDF
【 摘 要 】

Background

There are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.

Results

Through the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.

Conclusions

Our analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

【 授权许可】

   
2015 Duck et al.

【 预 览 】
附件列表
Files Size Format View
20150712092254909.pdf 664KB PDF download
Fig. 1. 18KB Image download
【 图 表 】

Fig. 1.

【 参考文献 】
  • [1]Duck G, Nenadic G, Brass A, Robertson DL, Stevens R: Extracting patterns of database and software usage from the bioinformatics literature. Bioinformatics 2014, 30:i601-8.
  • [2]Eales JM, Pinney JW, Stevens RD, Robertson DL: Methodology capture: discriminating between the “best” and the rest of community practice. BMC Bioinformatics 2008, 9:359. BioMed Central Full Text
  • [3]Stevens R, Glover K, Greenhalgh C, Jennings C, Pearce S, Li P, et al. Performing in silico experiments on the grid: a users perspective. In: Proc UK e-Science Program All Hands Meet; 2003. p. 43–50.
  • [4]Brazas MD, Yim DS, Yamada JT, Ouellette BFF: The 2011 bioinformatics links directory update: more resources, tools and databases and features to empower the bioinformatics community. Nucleic Acids Res 2011, 39(Suppl 2):W3-7.
  • [5]Galperin MY, Cochrane GR: The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res 2011, 39(Database issue):D1-6.
  • [6]ExPASy: SIB Bioinformatics Resource Portal.. http://expasy.org/ webcite
  • [7]Chen Y-B, Chattopadhyay A, Bergen P, Gadd C, Tannery N: The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System–a one-stop gateway to online bioinformatics databases and software tools. Nucleic Acids Res 2007, 35(Database issue):D780-5.
  • [8]Duck G, Nenadic G, Brass A, Robertson DL, Stevens R: bioNerDS: exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics 2013, 14:194. BioMed Central Full Text
  • [9]Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB: Frontiers of biomedical text mining: current progress. Brief Bioinform 2007, 8:358-75.
  • [10]Gerner M, Nenadic G, Bergman CM: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010, 11:85. BioMed Central Full Text
  • [11]Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005, 6 Suppl 1(Suppl 1):S1. BioMed Central Full Text
  • [12]Settles B: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 2005, 21:3191-2.
  • [13]Hakenberg J, Plake C, Leaman R, Schroeder M, Gonzalez G: Inter-species normalization of gene mentions with GNAT. Bioinformatics 2008, 24:i126-32.
  • [14]Kolluru B, Hawizy L, Murray-Rust P, Tsujii J, Ananiadou S: Using workflows to explore and optimise named entity recognition for chemistry. PLoS One 2011, 6:e20181.
  • [15]Dingare S, Nissim M, Finkel J, Manning C, Grover C: A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluations. Comp Funct Genomics 2005, 6:77-85.
  • [16]Leser U, Hakenberg J: What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 2005, 6:357-69.
  • [17]Yamamoto Y, Takagi T: OReFiL: an online resource finder for life sciences. BMC Bioinformatics 2007, 8:287. BioMed Central Full Text
  • [18]De la Calle G, García-Remesal M, Chiesa S, de la Iglesia D, Maojo V: BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics 2009, 10:320. BioMed Central Full Text
  • [19]Duck G, Stevens R, Robertson D, Nenadic G. Ambiguity and Variability of Database and Software Names in Bioinformatics. In: Ananiadou S, Pyysalo S, Rebholz-Schuhmann D, Rinaldi F, Salakoski T, editors. Proc 5th Int Symp Semant Min Biomed; 2012. p. 2–9
  • [20]Kovačević A, Konjović Z, Milosavljević B, Nenadic G: Mining methodologies from NLP publications: A case study in automatic terminology recognition. Comput Speech Lang 2012, 26:105-26.
  • [21]Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247:536-40.
  • [22]The UniProt Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt) Nucleic Acids Res 2012, 40(Database issue):D71-5.
  • [23]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25:25-9.
  • [24]Home - PubMed - NCBI.. https://www.ncbi.nlm.nih.gov/pubmed webcite
  • [25]Software - Wikipedia, the free encylopedia.. https://en.wikipedia.org/wiki/Software webcite
  • [26]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-10.
  • [27]Sayers E, Wheeler D: Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils). In NCBI Short Courses [Internet]. National Center for Biotechnology Information (US), Bethesda (MD); 2004.
  • [28]R Development Core Team. R: A Language and Environment for Statistical Computing. 2011
  • [29]Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5:R80. BioMed Central Full Text
  • [30]Roberts RJ: PubMed Central: The GenBank of the published literature. Proc Natl Acad Sci U S A 2001, 98:381-2.
  • [31]Kim J-D, Tsujii J: Corpora and Their Annotation. In Text Min Biol Biomed. Edited by Ananiadou S, McNaught J. Artech House, Boston and London; 2006:179-211.
  • [32]Cunningham H, Maynard D, Bontcheva K, Tablan V, Aswani N, Roberts I, et al. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science; 2011.. https://gate.ac.uk/books.html webcite
  • [33]Lafferty JD, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proc Eighteenth Int Conf Mach Learn. Morgan Kaufmann Publishers Inc; 2001. p. 282–289.
  • [34]Kovačević A, Dehghan A, Filannino M, Keane JA, Nenadic G: Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. J Am Med Informatics Assoc 2013, 20:859-66.
  • [35]De Marneffe M-C, MacCartney B, Manning CD. Generating Typed Dependency Parses from Phrase Structure Parses. In: Lr 2006; 2006
  • [36]Klein D, Manning CD: Accurate unlexicalized parsing. In Proc 41st Annu Meet Assoc Comput Linguist - Vol 1. Association for Computational Linguistics, Sapporo, Japan; 2003:423-30.
  • [37]CRF++.. http://crfpp.sourceforge.net/ webcite
  • [38]Porter Stemming Algorithm.. http://tartarus.org/martin/PorterStemmer/ webcite
  • [39]Torii M, Hu Z, Song M, Wu CH, Liu H: A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC Bioinformatics 2007, 8 Suppl 9(Suppl 9):S5. BioMed Central Full Text
  • [40]Free Phylogenetic Network Software.. http://www.fluxus-engineering.com/sharenet.htm webcite
  • [41]Thornton K: libsequence: a C++ class library for evolutionary genetic analysis. Bioinformatics 2003, 19:2325-7.
  • [42]Kevin’s Word List Page.. http://wordlist.sourceforge.net/ webcite
  • [43]Zhou W, Torvik VI, Smalheiser NR: ADAM: another database of abbreviations in MEDLINE. Bioinformatics 2006, 22:2813-8.
  • [44]Hearst MA: Automatic acquisition of hyponyms from large text corpora. In Proc 14th Conf Comput Linguist - Vol 2. Association for Computational Linguistics, Morristown, NJ, USA; 1992:539-45.
  • [45]Southan C, Cameron G. Database Provider Survey. 2009. p. 1–58
  文献评价指标  
  下载次数:46次 浏览次数:14次