期刊论文详细信息
BMC Medical Informatics and Decision Making
Predicting disease risks from highly imbalanced data using random forest
Research Article
Mohammed Khalilia1  Mihail Popescu2  Sounak Chakraborty3 
[1] Department of Computer Science, University of Missouri, Columbia, Missouri, USA;Department of Health Management and Informatics, University of Missouri, Columbia, Missouri, USA;Department of Statistics, University of Missouri, Columbia, Missouri, USA;
关键词: Support Vector Machine;    Random Forest;    Imbalanced Data;    Disease Prediction;    National Inpatient Sample;   
DOI  :  10.1186/1472-6947-11-51
 received in 2010-09-28, accepted in 2011-07-29,  发布年份 2011
来源: Springer
PDF
【 摘 要 】

BackgroundWe present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.MethodsWe employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.ResultsWe predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.ConclusionsIn combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.

【 授权许可】

Unknown   
© Khalilia et al; licensee BioMed Central Ltd. 2011. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

【 预 览 】
附件列表
Files Size Format View
RO202311095322346ZK.pdf 1618KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  文献评价指标  
  下载次数:12次 浏览次数:1次