Electronics | |
Identification of Secondary Breast Cancer in Vital Organs through the Integration of Machine Learning and Microarrays | |
Ahmad Almogren1  Faisal Riaz2  Fazeel Abid2  Ikram Ud Din3  Byung-Seo Kim4  Shajara Ul Durar5  | |
[1] Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11633, Saudi Arabia;Department of Information Systems, University of Management and Technology, Lahore 54770, Pakistan;Department of Information Technology, The University of Haripur, Haripur 22620, Pakistan;Department of Software and Communications Engineering, Hongik University, Sejong 30016, Korea;Management and Organizational Behaviour Business School, University for the Creative Arts, Epsom KT18 5BE, UK; | |
关键词: metastasis; microarray; gene expression omnibus; decision trees; random forest; K-nearest neighbours; | |
DOI : 10.3390/electronics11121879 | |
来源: DOAJ |
【 摘 要 】
Breast cancer includes genetic and environmental factors and is the most prevalent malignancy in women contributing to the pathogenesis and progression of cancer. Breast cancer prognosis metastasizes towards bones, the liver, brain, and lungs, and is the main cause of death in patients. Furthermore, the selection of features and classification is significant in microarray data analysis, which suffers from huge time consumption. To address these issues, this research uniquely integrates machine learning and microarrays to identify secondary breast cancer in vital organs. This work firstly imputes the missing values using K-nearest neighbors and improves the recursive feature elimination with cross-validation (RFECV) using the random forest method. Secondly, the class imbalance is handled by employing K-means synthetic object oversampling technique (SMOTE) to balance minority class and prevent noise. We successfully identified the 16 most essential Entrez gene ids responsible for predicting metastatic locations in the bones, brain, liver, and lungs. Extensive experiments are conducted on NCBI Gene Expression Omnibus GSE14020 and GSE54323 datasets. The proposed methods have handled class imbalance, prevented noise, and appropriately reduced time consumption. Reliable results were obtained on four classification models: decision tree; K-nearest neighbors; random forest; and support vector machine. Results are presented having considered confusion matrices, accuracy, ROC-AUC and PR-AUC, and F1-score.
【 授权许可】
Unknown