期刊论文详细信息
BMC Bioinformatics
Using cited references to improve the retrieval of related biomedical documents
Francisco M Ortuño1  Ignacio Rojas1  Miguel A Andrade-Navarro2  Jean-Fred Fontaine2 
[1] Computer Architecture and Computer Technology Department, University of Granada, C/ Periodista Daniel Saucedo Aranda S/N, Granada, 18071, Spain
[2] Computational Biology and Data Mining, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin, 13125, Germany
关键词: Document classification;    Query expansion;    Biomedical literature;    Full-text documents;    Citations;    Text categorization;    Information retrieval;   
Others  :  1087922
DOI  :  10.1186/1471-2105-14-113
 received in 2012-09-25, accepted in 2013-03-18,  发布年份 2013
PDF
【 摘 要 】

Background

A popular query from scientists reading a biomedical abstract is to search for topic-related documents in bibliographic databases. Such a query is challenging because the amount of information attached to a single abstract is little, whereas classification-based retrieval algorithms are optimally trained with large sets of relevant documents. As a solution to this problem, we propose a query expansion method that extends the information related to a manuscript using its cited references.

Results

Data on cited references and text sections in 249,108 full-text biomedical articles was extracted from the Open Access subset of the PubMed Central® database (PMC-OA). Of the five standard sections of a scientific article, the Introduction and Discussion sections contained most of the citations (mean = 10.2 and 9.9 citations, respectively). A large proportion of articles (98.4%) and their cited references (79.5%) were indexed in the PubMed® database.

Using the MedlineRanker abstract classification tool, cited references allowed accurate retrieval of the citing document in a test set of 10,000 documents and also of documents related to six biomedical topics defined by particular MeSH® terms from the entire PMC-OA (p-value<0.01).

Classification performance was sensitive to the topic and also to the text sections from which the references were selected. Classifiers trained on the baseline (i.e., only text from the query document and not from the references) were outperformed in almost all the cases. Best performance was often obtained when using all cited references, though using the references from Introduction and Discussion sections led to similarly good results. This query expansion method performed significantly better than pseudo relevance feedback in 4 out of 6 topics.

Conclusions

The retrieval of documents related to a single document can be significantly improved by using the references cited by this document (p-value<0.01). Using references from Introduction and Discussion performs almost as well as using all references, which might be useful for methods that require reduced datasets due to computational limitations. Cited references from particular sections might not be appropriate for all topics. Our method could be a better alternative to pseudo relevance feedback though it is limited by full text availability.

【 授权许可】

   
2013 Ortuño et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117055430956.pdf 446KB PDF download
【 参考文献 】
  • [1]Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2012, 40(Database issue):D13-D25.
  • [2]Hersh WR, Bhupatiraju RT, Ross L, Cohen AM, Kraemer D, Johnson P, Hersh WR, Bhupatiraju RT, Ross L, Cohen AM, Kraemer D, Johnson P: Proceedings of the Thirteenth Text REtrieval Conference. Gaithersburg, MD, USA: National Institute of Standards and Technology (NIST); 2004.
  • [3]Hersh WR, Cohen AM, Roberts PM, Rekapalli HK: TREC 2006 Genomics Track Overview. In Proceedings of the Fifteenth Text REtrieval Conference. Gaithersburg, MD, USA: National Institute of Standards and Technology (NIST); 2006.
  • [4]Hersh WR, Cohen AM, Ruslen L, Roberts PM: TREC 2007 Genomics Track Overview. In Proceedings of The Sixteenth Text REtrieval Conference. Gaithersburg, MD, USA: National Institute of Standards and Technology (NIST); 2007.
  • [5]Hersh WR, Cohen AM, Yang J, Bhupatiraju RT, Roberts PM, Hearst MA: TREC 2005 Genomics Track Overview. In Proceedings of the Fourteenth Text REtrieval Conference. National Institute of Standards and Technology (NIST); 2005.
  • [6]Carpineto C, Romano G: A Survey of Automatic Query Expansion in Information Retrieval. ACM Comput Surv 2012, 44(1):1-50.
  • [7]Bloehdorn S, Hotho A: Boosting for Text Classification with Semantic Features. In Advances in Web Mining and Web Usage Analysis. Volume 3932. Edited by Mobasher B, Nasraoui O, Liu B, Masand B. Heidelberg: Springer Berlin; 2006::149-166.
  • [8]Garla VN, Brandt C: Ontology-guided feature engineering for clinical text classification. J Biomed Inform 2012, 45(5):992-998.
  • [9]Krallinger M, Leitner F, Valencia A: Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol 2010, 593:341-382.
  • [10]Poulter GL, Rubin DL, Altman RB, Seoighe C: MScanner: a classifier for retrieving Medline citations. BMC Bioinformatics 2008, 9:108. BioMed Central Full Text
  • [11]Suomela BP, Andrade MA: Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 2005, 6:75. BioMed Central Full Text
  • [12]Van Landeghem S, Abeel T, Saeys Y, Van de Peer Y: Discriminative and informative features for biomolecular text mining with ensemble feature selection. Bioinformatics 2010, 26(18):i554-i560.
  • [13]Lin J, Wilbur WJ: PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics 2007, 8:423. BioMed Central Full Text
  • [14]Lin J: Is searching full text more effective than searching abstracts? BMC Bioinformatics 2009, 10(1):46. BioMed Central Full Text
  • [15]Couto T, Cristo M, Gonçalves MA, Calado P, Ziviani N: A comparative study of citations and links in document classification. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries. New York, NY, USA: ACM; 2006:75-84. 1141766
  • [16]Small H: Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. J Am Soc Inf Sci 1973, 24(4):265-269.
  • [17]Kessler MM: Bibliographic coupling between scientific papers. Am Doc 1963, 14(1):10-25.
  • [18]Amsler RA: Center TUaALR: Applications of Citation-based. Automatic Classification: Linguistics Research Center, University of Texas at Austin; 1972.
  • [19]Bernstam EV, Herskovic JR, Aphinyanaphongs Y, Aliferis CF, Sriram MG, Hersh WR: Using citation data to improve retrieval from MEDLINE. J Am Med Inform Assn 2006, 13(1):96-105.
  • [20]Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. Comput Networks ISDN Syst 1998, 30(1–7):107-117.
  • [21]Aljaber B, Stokes N, Bailey J, Pei J: Document clustering of scientific texts using citation contexts. Inf Retrieval 2010, 13(2):101-131.
  • [22]Tran N, Alves P, Ma S, Krauthammer M: Enriching PubMed related article search with sentence level co-citations. AMIA Annu Symp Proc 2009, 2009:650-654.
  • [23]Elkiss A, Shen S, Fader A, Erkan G, States D, Radev D: Blind men and elephants: What do citation summaries tell us about a research article? J Am Soc Inf Sci Technol 2008, 59(1):51-62.
  • [24]Ritchie A, Teufel S, Robertson S: Using Terms from Citations for IR: Some First Results. In Advances in Information Retrieval, vol. 4956. Edited by Macdonald C, Ounis I, Plachouras V, Ruthven I, White R. Springer Berlin / Heidelberg; 2008:211-221.
  • [25]Schmid H: Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester, UK: University of Manchester; 1994:44-49.
  • [26]Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA: MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res 2009, 37:W141-W146.
  • [27]Wilbur WJ, Kim W: The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 2009, 12(5):509-525.
  • [28]R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011.
  • [29]Mann HB, Whitney DR: On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann Math Stat 1947, 18(1):50-60.
  • [30]Lewis D, Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98. Edited by Nédellec C, Rouveirol C. Heidelberg: Springer Berlin; 1998:4-15.
  • [31]Poulter GL: Rapid Statistical Classification on the Medline Database of. Biomedical Literature: University of Cape Town; 2008.
  • [32]Sparck-Jones K, Walker S, Robertson SE: A probabilistic model of information retrieval: development and comparative experiments Part 1. Inform Process Manag 2000, 36(6):779-808.
  • [33]Sparck-Jones K, Walker S, Robertson SE: A probabilistic model of information retrieval: development and comparative experiments Part 2. Inform Process Manag 2000, 36(6):809-840.
  • [34]Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Borner K: Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS One 2011, 6(3):e18029.
  • [35]Shah PK, Perez-Iratxeta C, Bork P, Andrade MA: Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 2003, 4:20. BioMed Central Full Text
  • [36]Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M: The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12(Suppl 8):S3. BioMed Central Full Text
  • [37]Hersh W, Buckley C, Leone T, Hickam DH: OHSUMED: An interactive retrieval evaluation and new large test collection for research. Proceedings of the 17th Annual ACM SIGIR Conference 1994, 192-201.
  • [38]Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ: The NLM Indexing Initiative's Medical Text Indexer. Stud Health Technol Inform 2004, 107(Pt 1):268-272.
  • [39]Delbecque T, Zweigenbaum P: Using Co-Authoring and Cross-Referencing Information for MEDLINE Indexing. AMIA Annu Symp Proc 2010, 2010:147-151.
  • [40]Herskovic JR, Cohen T, Subramanian D, Iyengar MS, Smith JW, Bernstam EV: MEDRank: using graph-based concept ranking to index biomedical texts. Int J Med Inform 2011, 80(6):431-441.
  • [41]Huang M, Neveol A, Lu Z: Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc 2011, 18(5):660-667.
  • [42]Neveol A, Shooshan SE, Claveau V: Automatic inference of indexing rules for MEDLINE. BMC Bioinformatics 2008, 9(Suppl):11-S11.
  • [43]Neveol A, Shooshan SE, Humphrey SM, Mork JG, Aronson AR: A recent advance in the automatic indexing of the biomedical literature. J Biomed Inform 2009, 42(5):814-823.
  • [44]Tbahriti I, Chichester C, Lisacek F, Ruch P: Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library. Int J Med Inform 2006, 75(6):488-495.
  • [45]Erdmann M, Nguyen D, Takeyoshi T, Hattori G, Matsumoto K, Ono C: Hierarchical Training of Multiple SVMs for Personalized Web Filtering. In PRICAI 2012: Trends in Artificial Intelligence. Heidelberg: Springer Berlin; 2012:27-39.
  • [46]Yu H, Kim J, Kim Y, Hwang S, Lee YH: An efficient method for learning nonlinear ranking SVM functions. Inform Sci 2012, 209:37-48.
  • [47]Jimeno-Yepes A, Mork JG, Demner-Fushman D, Aronson AR: A One-Size-Fits-All Indexing Method Does Not Exist: Automatic Selection Based on Meta-Learning. J Comput Sci Eng 2012, 6(2):151-160.
  • [48]Jimeno-Yepes A, Mork JG, Wilkowski B, Demner-Fushman D, Aronson AR: MEDLINE MeSH indexing: lessons learned from machine learning and future directions. Miami, Florida, USA: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium; 2012:737-742. 2110450
  • [49]Salton G, Buckley C: Improving retrieval performance by relevance feedback. J Am Soc Inf Sci 1990, 41(4):288-297.
  文献评价指标  
  下载次数:12次 浏览次数:64次