BMC Bioinformatics | |
A framework for feature extraction from hospital medical data with applications in risk prediction | |
Truyen Tran3  Wei Luo2  Dinh Phung2  Sunil Gupta2  Santu Rana2  Richard Lee Kennedy1  Ann Larkins4  Svetha Venkatesh2  | |
[1] School of Medicine, Deakin University, Geelong, VIC, Australia | |
[2] Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong 3220, VIC, Australia | |
[3] Department of Computing, Curtin University, Perth, WA, Australia | |
[4] Barwon Health, Geelong, VIC, Australia | |
关键词: Hospital data; Risk prediction; Feature extraction; | |
Others : 1114582 DOI : 10.1186/s12859-014-0425-8 |
|
received in 2014-06-06, accepted in 2014-12-11, 发布年份 2014 | |
【 摘 要 】
Background
Feature engineering is a time consuming component of predictive modeling. We propose a versatile platform to automatically extract features for risk prediction, based on a pre-defined and extensible entity schema. The extraction is independent of disease type or risk prediction task. We contrast auto-extracted features to baselines generated from the Elixhauser comorbidities.
Results
Hospital medical records was transformed to event sequences, to which filters were applied to extract feature sets capturing diversity in temporal scales and data types. The features were evaluated on a readmission prediction task, comparing with baseline feature sets generated from the Elixhauser comorbidities. The prediction model was through logistic regression with elastic net regularization. Predictions horizons of 1, 2, 3, 6, 12 months were considered for four diverse diseases: diabetes, COPD, mental disorders and pneumonia, with derivation and validation cohorts defined on non-overlapping data-collection periods.
For unplanned readmissions, auto-extracted feature set using socio-demographic information and medical records, outperformed baselines derived from the socio-demographic information and Elixhauser comorbidities, over 20 settings (5 prediction horizons over 4 diseases). In particular over 30-day prediction, the AUCs are: COPD—baseline: 0.60 (95% CI: 0.57, 0.63), auto-extracted: 0.67 (0.64, 0.70); diabetes—baseline: 0.60 (0.58, 0.63), auto-extracted: 0.67 (0.64, 0.69); mental disorders—baseline: 0.57 (0.54, 0.60), auto-extracted: 0.69 (0.64,0.70); pneumonia—baseline: 0.61 (0.59, 0.63), auto-extracted: 0.70 (0.67, 0.72).
Conclusions
The advantages of auto-extracted standard features from complex medical records, in a disease and task agnostic manner were demonstrated. Auto-extracted features have good predictive power over multiple time horizons. Such feature sets have potential to form the foundation of complex automated analytic tasks.
【 授权许可】
2014 Tran et al.; licensee BioMed Central.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150205011848409.pdf | 1046KB | download | |
Figure 3. | 24KB | Image | download |
Figure 2. | 69KB | Image | download |
Figure 1. | 79KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
【 参考文献 】
- [1]Mayer-Schonberger V, Cukier KN: Big Data: A Revolution That Will Transform How We Live, Work, and Think: Eamon Dolan/Houghton Mifflin Harcourt; 2013
- [2]de Lusignan S, van Weel C: The use of routinely collected computer data for research in primary care: opportunities and challenges. Fam Pract 2006, 23(2):253-263.
- [3]Molokhia M, Curcin V, Majeed A: Improving pharmacovigilance. Use of routinely collected data. BMJ (Clin Res Ed) 2010, 340:c2403.
- [4]Luo W, Cao J, Gallagher M, Wiles J: Estimating the intensity of ward admission and its effect on emergency department access block. Statistics in medicine 2013, 32(15):2681-2694.
- [5]de Lusignan S, Metsemakers JF, Houwink P, Gunnarsdottir V, van der Lei J: Routinely collected general practice data: goldmines for research? A report of the European Federation for Medical Informatics Primary Care Informatics Working Group (EFMI PCIWG) from MIE2006, Maastricht, The Netherlands. Inform Primary Care 2006, 14(3):203-209.
- [6]Keen J, Calinescu R, Paige R, Rooksby J: Big data + politics = open data: The case of health care data in England.Policy and Internet 2013, 5(2)228–243.
- [7]Sharabiani MT, Aylin P, Bottle A: Systematic review of comorbidity indices for administrative data. Med Care 2012, 50(12):1109-1118.
- [8]Elixhauser A, Steiner C, Harris DR, Coffey RM: Comorbidity measures for use with administrative data. Med Care 1998, 36(1):8-27.
- [9]Charlson ME, Pompei P, Ales KL, MacKenzie CR: A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis 1987, 40(5):373-383.
- [10]Deyo RA, Cherkin DC, Ciol MA: Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J Clin Epidemiol 1992, 45(6):613-619.
- [11]Romano PS, Roos LL, Jollis JG: Adapting a clinical comorbidity index for use with ICD-9-CM administrative data: differing perspectives. J Clin Epidemiol 1993, 46(10):1075-1079. discussion 1081–1090
- [12]Tabak YP, Sun X, Derby KG, Kurtz SG, Johannes RS: Development and validation of a disease-specific risk adjustment system using automated clinical data. Health Serv Res 2010, 45(6 Pt 1):1815-1835.
- [13]Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, Kripalani S: Risk prediction models for hospital readmission: A systematic review. JAMA 2011, 306(15):1688-1698.
- [14]Foreign Affairs Media Conference Call: Kenneth Cukier and Michael Flowers on "Big Data". http://www.cfr.org/health-science-and-technology/foreign-affairs-media-conference-call-kenneth-cukier-michael-flowers-big-data/p30695
- [15]Lenzer J: FDA is incapable of protecting US "against another Vioxx". BMJ (Clin Res Ed) 2004, 329(7477):1253.
- [16]Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. In., 2 edn: Springer, New York, NY, USA, 2009.
- [17]Markovitch S, Rosenstein D: Feature generation using general constructor functions. Mach Learn 2002, 49(1):59-98.
- [18]Shahar Y, Musen MA: Knowledge-based temporal abstraction in clinical domains. Artif Intell Med 1996, 8(3):267-298.
- [19]Dharmarajan K, Hsieh AF, Lin Z, Bueno H, Ross JS, Horwitz LI, Barreto-Filho JA, Kim N, Bernheim SM, Suter LG, Drye EE, Krumholz HM: Diagnoses and timing of 30-day readmissions after hospitalization for heart failure, acute myocardial infarction, or pneumonia. JAMA 2013, 309(4):355-363.
- [20]Krumholz HM, Lin Z, Keenan PS, Chen J, Ross JS, Drye EE, Bernheim SM, Wang Y, Bradley EH, Han LF, Normand SLT: Relationship between hospital readmission and mortality rates for patients hospitalized with acute myocardial infarction, heart failure, or pneumonia. JAMA 2013, 309(6):587-593.
- [21]Burke RE, Coleman EA: Interventions to Decrease Hospital Readmissions: Keys for Cost-effectiveness. JAMA Intern Med 2013, 173(8):695-698.
- [22]Chatfield C: The analysis of time series: an introduction, vol. 59: Chapman and Hall/CRC Boca Raton, Florida; 2003
- [23]Wood SN: Generalized additive models: an introduction with R, vol. 66: Chapman & Hall; 2006
- [24]WHO: International Statistical Classification of Diseases and Related Health Problems 10th Revision. In.; 2010
- [25]The National Casemix and Classification Centre: The Australian Classification of Health Interventions (ACHI). 7th edition. The National Casemix and Classification Centre, Sydney; 2013.
- [26]World Health Organization: Anatomical therapeutic chemical classification system. WHO, Oslo, Norway; 2003.
- [27]National Casemix and Classification Centre: Australian Refined Diagnosis Related Groups (AR-DRGs). National Casemix And Classification Centre, Sydney; 2012.
- [28]Strang G, Nguyen T: Wavelets and filter banks: Wellesley Cambridge Press, Wellesley MA, USA; 1996
- [29]Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi JC, Saunders LD, Beck CA, Feasby TE, Ghali WA: Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care 2005, 43(11):1130-1139.
- [30]Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 2005, 67(2):301-320.
- [31]Ye N: The handbook of data mining: Lawrence Erlbaum Associates, Publishers, Mahwah, NJ, USA; 2003
- [32]Halfon P, Eggli Y, Pretre-Rohrbach I, Meylan D, Marazzi A, Burnand B: Validation of the potentially avoidable hospital readmission rate as a routine indicator of the quality of hospital care. Med Care 2006, 44(11):972-981.
- [33]Allaudeen N, Schnipper JL, Orav EJ, Wachter RM, Vidyarthi AR: Inability of providers to predict unplanned readmissions. J Gen Intern Med 2011, 26(7):771-776.
- [34]Mathias JS, Agrawal A, Feinglass J, Cooper AJ, Baker DW, Choudhary A: Development of a 5 year life expectancy index in older adults using predictive mining of electronic health record data. J Am Med Inform Assoc: JAMIA 2013, 20(e1):e118-e124.
- [35]Coleman EA, Min SJ, Chomiak A, Kramer AM: Posthospital care transitions: patterns, complications, and risk identification. Health Serv Res 2004, 39(5):1449-1465.
- [36]Lowe D: Distinctive Image Features from Scale-Invariant Keypoints. Int J Comput Vis 2004, 60(2):91-110.