期刊论文详细信息
Frontiers in Digital Health 卷:4
Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature
Nicholas A. R. McQuibban1  Tom Shorter2  Thomas Rowlands2  Casiana M. Popovici3  Yan Hu3  Tim Beck4  Zhuoyu Li5  Filip Makraduli5  Shujian Sun5  Joram M. Posma5  Cheng S. Yeung5 
[1] Centre for Integrative Systems Biology and Bioinformatics (CISBIO), Department of Life Sciences, Imperial College London, London, United Kingdom;
[2] Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom;
[3] Department of Surgery and Cancer, Imperial College London, London, United Kingdom;
[4] Health Data Research UK (HDR UK), London, United Kingdom;
[5] Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, United Kingdom;
关键词: natural language processing;    text mining;    biomedical literature;    semantics;    health data;   
DOI  :  10.3389/fdgth.2022.788124
来源: DOAJ
【 摘 要 】

To analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardized. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardization and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardize the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries. The Auto-CORPus package is freely available with detailed instructions from GitHub at: https://github.com/omicsNLP/Auto-CORPus.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:2次