期刊论文详细信息
BioData Mining
Classification of breast cancer recurrence based on imputed data: a simulation study
Research
Rahibu A. Abassi1  Amina S. Msengwa2 
[1] Department of Natural Sciences, State University of Zanzibar, Zanzibar, Tanzania;Department of Statistics, University of Dar es Salaam, Dar es Salaam, Tanzania;
关键词: Classification accuracy;    Imputed data;    Missing data mechanisms;    Missingness percentages;    Simulation;   
DOI  :  10.1186/s13040-022-00316-8
 received in 2022-07-06, accepted in 2022-11-23,  发布年份 2022
来源: Springer
PDF
【 摘 要 】

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

【 授权许可】

CC BY   
© The Author(s) 2022

【 预 览 】
附件列表
Files Size Format View
RO202305067154993ZK.pdf 1109KB PDF download
Fig. 3 62KB Image download
Fig. 9 2022KB Image download
MediaObjects/12888_2022_4371_MOESM1_ESM.docx 28KB Other download
Fig. 4 161KB Image download
40517_2022_243_Article_IEq10.gif 1KB Image download
Fig. 2 232KB Image download
Fig. 1 752KB Image download
【 图 表 】

Fig. 1

Fig. 2

40517_2022_243_Article_IEq10.gif

Fig. 4

Fig. 9

Fig. 3

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  • [38]
  • [39]
  • [40]
  • [41]
  文献评价指标  
  下载次数:0次 浏览次数:0次