Source Code for Biology and Medicine | |
Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database | |
Bernard J Pope2  Karin Verspoor1  Sori Kang2  Tú Nguyen-Dumont3  Daniel J Park3  | |
[1] Department of Computing and Information Systems, Doug McDonell Building, The University of Melbourne, Melbourne, Victoria 3010, Australia;Victorian Life Sciences Computation Initiative, The University of Melbourne, 187 Grattan Street, Melbourne, Victoria 3010, Australia;Genetic Epidemiology Laboratory, Department of Pathology, Medical Building, The University of Melbourne, Melbourne, Victoria 3010, Australia | |
关键词: PubMed article summaries; NCBI gene database; Keyword search; Gene annotation; | |
Others : 1146079 DOI : 10.1186/1751-0473-9-15 |
|
received in 2014-03-27, accepted in 2014-06-05, 发布年份 2014 | |
【 摘 要 】
Background
The NCBI Entrez Gene and PubMed databases contain a wealth of high-quality information about genes for many different organisms. The NCBI Entrez online web-search interface is convenient for simple manual search for a small number of genes but impractical for the kinds of outputs seen in typical genomics projects.
Results
We have developed an efficient open source tool implemented in Python called Annokey, which annotates gene lists with the results of a keyword search of the NCBI Entrez Gene database and linked Pubmed article information. The user steers the search by specifying a ranked list of keywords (including multi-word phrases and regular expressions) that are correlated with their topic of interest. Rank information of matched terms allows the user to guide further investigation.
We applied Annokey to the entire human Entrez Gene database using the key-term “DNA repair” and assessed its performance in identifying the 176 members of a published “gold standard” list of genes established to be involved in this pathway. For this test case we observed a sensitivity and specificity of 97% and 96%, respectively.
Conclusions
Annokey facilitates the identification of genes related to an area of interest, a task which can be onerous if performed manually on a large number of genes. Annokey provides a way to capitalize on the high quality information provided by the Entrez Gene database allowing both scalability and compatibility with automated analysis pipelines, thus offering the potential to significantly enhance research productivity.
【 授权许可】
2014 Park et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150403090957934.pdf | 1538KB | download | |
Figure 1. | 111KB | Image | download |
【 图 表 】
Figure 1.
【 参考文献 】
- [1]Moorthie S, Mattocks CJ, Wright CF: Review of massively parallel DNA sequencing technologies. HUGO J 2011, 5:1-12.
- [2]Southey MC: The role of new sequencing technologies in identifying rare mutations in new susceptibility genes for cancer. Curr Genet Med Rep 2013, 1:7.
- [3]Do R, Kathiresan S, Abecasis GR: Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet 2012, 21:R1-R9.
- [4]Snape K, Ruark E, Tarpey P, Renwick A, Turnbull C, Seal S, Murray A, Hanks S, Douglas J, Stratton MR, Rahman N: Predisposition gene identification in common cancers by exome sequencing: insights from familial breast cancer. Breast Cancer Res Treat 2012, 134:429-433.
- [5]Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38:e164.
- [6]Human DNA repair genes public database http://sciencepark.mdanderson.org/labs/wood/dna_repair_genes.html webcite - Human
- [7]Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2011, 39:D52-D57.
- [8]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25-29.
- [9]The NCBI Entrez Gene database http://www.ncbi.nlm.nih.gov/gene webcite
- [10]Mrozek D, Malysiak-Mrozek B, Siaznik A: search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information. BMC Bioinformatics 2013, 14:73.
- [11]Python: Regular expressions documentation http://docs.python.org/2/library/re.html webcite
- [12]The lxml toolkit http://lxml.de/ webcite
- [13]Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25:1422-1423.
- [14]NCBI ftp server [ftp.ncbi.nlm.nih.gov]
- [15]MEDLINE http://www.nlm.nih.gov/pubs/factsheets/medline.html webcite
- [16]HTML Python library https://pypi.python.org/pypi/html/ webcite
- [17]W3C Validator http://validator.w3.org/ webcite
- [18]Annokey User documentation http://bjpop.github.io/annokey/ webcite
- [19]Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37:13.
- [20]Lu Z: PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford) 2011, 2011:baq036.
- [21]Plake C, Royer L, Winnenburg R, Hakenberg J, Schroeder M: GoGene: gene annotation in the fast lane. Nucleic Acids Res 2009, 37:W300-W304.
- [22]Brancotte B, Biton A, Bernard-Pierrot I, Radvanyi F, Reyal F, Cohen-Boulakia S: Gene List significance at-a-glance with GeneValorization. Bioinformatics 2011, 27:1187-1189.