期刊论文详细信息
BMC Medical Informatics and Decision Making
Korean clinical entity recognition from diagnosis text using BERT
Tae-Hoon Lee1  Young-Min Kim2 
[1] Division of Interdisciplinary Industrial Studies, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul, South Korea;Graduate School of Technology & Innovation Management, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul, South Korea;Division of Interdisciplinary Industrial Studies, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul, South Korea;
关键词: Clinical entity recognition;    BERT;    Korean;    Diagnosis text;   
DOI  :  10.1186/s12911-020-01241-8
来源: Springer
PDF
【 摘 要 】

BackgroundWhile clinical entity recognition mostly aims at electronic health records (EHRs), there are also the demands of dealing with the other type of text data. Automatic medical diagnosis is an example of new applications using a different data source. In this work, we are interested in extracting Korean clinical entities from a new medical dataset, which is completely different from EHRs. The dataset is collected from an online QA site for medical diagnosis. Bidirectional Encoder Representations from Transformers (BERT), which is one of the best language representation models, is used to extract the entities.ResultsA slightly modified version of BERT labeling strategy replaces the original labeling to enhance the separation of postpositions in Korean. A new clinical entity recognition dataset that we construct, as well as a standard NER dataset, have been used for the experiments. A pre-trained multilingual BERT model is used for the initialization of the entity recognition model. BERT significantly outperforms a character-level bidirectional LSTM-CRF, a benchmark model, in terms of all metrics. The micro-averaged precision, recall, and f1 of BERT are 0.83, 0.85 and 0.84, whereas that of bi-LSTM-CRF are 0.82, 0.79 and 0.81 respectively. The recall values of BERT are especially better than that of the other model. It can be interpreted that the trained BERT model could detect out of vocabulary (OOV) words better than bi-LSTM-CRF.ConclusionsThe recently developed BERT and its WordPiece tokenization are effective for the Korean clinical entity recognition. The experiments using a new dataset constructed for the purpose and a standard NER dataset show the superiority of BERT compared to a state-of-the-art method. To the best of our knowledge, this work is one of the first studies dealing with clinical entity extraction from non-EHR data.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202104267309497ZK.pdf 2396KB PDF download
  文献评价指标  
  下载次数:0次 浏览次数:1次