In the current era of electronic health records (EHR), use of data to make informed clinical decisions is at an all-time high. Although the collection, upkeep and accessibility of EHR data continues to grow, statistical methodology focused on aiding real-time clinical decision making is lacking. Improved decision making tools generally lead to improved patient outcomes and lower healthcare costs. In this dissertation, we propose three statistical learning methods to improve clinical decision making based on EHR data. In the first chapter we propose a new classifier: SVM-CART, that combines features of Support Vector Machines (SVM) and Classification and Regression Trees (CART) to produce a flexible classifier that outperforms either method in terms of prediction accuracy and ease of use. The method is especially powerful in situations where the disease-exposure mechanisms may be different across subgroups of the population.Through simulation, under settings with high levels of interaction, the SVM-CART classifier resulted in significant prediction accuracy improvements. We illustrate our method to diagnose neuropathy using various components of the metabolic syndrome. In predicting neuropathy, SVM-CART outperformed CART in terms of prediction accuracy and provided improved interpretability compared to SVM. In the second chapter, we develop regression tree and ensemble methods for multivariate outcomes. We propose two general approaches to develop multivariate regression trees by: (1) minimizing within-node homogeneity, and (2) maximizing between-node separation. Within-node homogeneity is measured using the average Mahalanobis distance and the determinant of the covariance matrix. For between-node separation, we propose using the Mahalanobis and Euclidean distances. The proposed multivariate regression trees are illustrated using two clinical datasets of neuropathy and pediatric cardiac surgery. In high variance scenarios or when the dimension of the outcome was large, the Mahalanobis distance split trees had the best prediction performance. The determinant split trees generally had a simple structure and the Euclidean distance metrics performed well in large sample settings. In both applications, the resulting multivariate trees improve usability and validity compared to predictions made using multiple univariate regression trees.In the third chapter we develop a sequential method to make prediction using shallow (large-scale EHR) data in tandem with deep (health system specific) patient data. Specifically, we utilize machine learning based methods to first give prediction based on a large-scale EHR, then for a select group of patients, refine prediction based on the deep EHR data. We develop a novel framework that is time and cost-effective, for identifying patient subgroups that would most benefit from a second-stage prediction refinement. Final tandem prediction is obtained by combining predictions from both the first and second stage classifiers. We apply our tandem approach to predict extubation failure for pediatric patients that have undergone a critical cardiac operation using shallow data from a national registry and deep continuously streamed data captured in the intensive care unit. Using these two EHR data sources in tandem increased our ability to identify extubation failures in terms of the area under the ROC curve (AUC: 0.639) compared to using just the national registry (AUC: 0.607) or physiologic ICU data (AUC: 0.634) alone. Additionally, identifying a specific patient subgroup for second stage prediction refinement resulted in additional prediction improvement, as opposed to giving each patient a deep-data prediction (AUC: 0.682).
【 预 览 】
附件列表
Files
Size
Format
View
Statistical Learning Methods for Electronic Health Record Data