| Big Data and Cognitive Computing | |
| A Simple Free-Text-like Method for Extracting Semi-Structured Data from Electronic Health Records: Exemplified in Prediction of In-Hospital Mortality | |
| Eyal Klang1  Matthew A. Levin2  David L. Reich2  Brendan G. Carr3  Alexis Zebrowski3  Jolion Mcgreevy3  Benjamin S. Glicksberg4  Robert Freeman5  Shelly Soffer6  | |
| [1] Chaim Sheba Medical Center, Department of Diagnostic Imaging, Affiliated to Tel-Aviv University, Tel Aviv-Yafo 52621, Israel;Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;Hasso Plattner Institute for Digital Health at Mount Sinai, New York, NY 10065, USA;Institute for Healthcare Delivery Science, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;Internal Medicine B, Assuta Medical Center, Ben-Gurion University of the Negev, Be’er Sheva 7747629, Israel; | |
| 关键词: electronic health records; machine learning; gradient boosting; | |
| DOI : 10.3390/bdcc5030040 | |
| 来源: DOAJ | |
【 摘 要 】
The Epic electronic health record (EHR) is a commonly used EHR in the United States. This EHR contain large semi-structured “flowsheet” fields. Flowsheet fields lack a well-defined data dictionary and are unique to each site. We evaluated a simple free-text-like method to extract these data. As a use case, we demonstrate this method in predicting mortality during emergency department (ED) triage. We retrieved demographic and clinical data for ED visits from the Epic EHR (1/2014–12/2018). Data included structured, semi-structured flowsheet records and free-text notes. The study outcome was in-hospital death within 48 h. Most of the data were coded using a free-text-like Bag-of-Words (BoW) approach. Two machine-learning models were trained: gradient boosting and logistic regression. Term frequency-inverse document frequency was employed in the logistic regression model (LR-tf-idf). An ensemble of LR-tf-idf and gradient boosting was evaluated. Models were trained on years 2014–2017 and tested on year 2018. Among 412,859 visits, the 48-h mortality rate was 0.2%. LR-tf-idf showed AUC 0.98 (95% CI: 0.98–0.99). Gradient boosting showed AUC 0.97 (95% CI: 0.96–0.99). An ensemble of both showed AUC 0.99 (95% CI: 0.98–0.99). In conclusion, a free-text-like approach can be useful for extracting knowledge from large amounts of complex semi-structured EHR data.
【 授权许可】
Unknown