期刊论文详细信息
Journal of Biomedical Semantics
De-identifying free text of Japanese electronic health records
Yoshinobu Kano1  Kohei Kajiyama1  Mizuki Morita2  Hiromasa Horiguchi3  Takashi Okumura4 
[1] Faculty of Informatics, Shizuoka University, Johoku 3-5-1, Naka-ku, Hamamatsu, 432-8011, Shizuoka, Japan;Graduate School of Interdisciplinary Science and Engineering in Health Systems, Okayama University, 2-5-1, Kita-ku, 700-8558, Okayama, Okayama, Japan;National Hospital Organization Headquaters, 2-5-21 Higashigaoka, Meguro-ku, 152-8621, Tokyo, Japan;National University Corporation Kitami Institute of Technology, 165, Koencho, 090-8507, Kitami, Hokkaido, Japan;
关键词: De-identification;    Electronic health records;    Japanese language;   
DOI  :  10.1186/s13326-020-00227-9
来源: Springer
PDF
【 摘 要 】

BackgroundRecently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.ResultsUsing three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR.ConclusionsOur LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance.Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202104248525258ZK.pdf 913KB PDF download
  文献评价指标  
  下载次数:8次 浏览次数:6次