BMC Bioinformatics | |
Breast cancer prediction using genome wide single nucleotide polymorphism data | |
Research | |
Farzad Sangi1  Babak Damavandi1  Mohsen Hajiloo1  Metanat HooshSadat1  Russell Greiner1  Sambasivarao Damaraju2  Carol E Cass3  John R Mackey3  | |
[1] Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada;Alberta Innovates Centre for Machine Learning, University of Alberta, Edmonton, Alberta, Canada;Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada;PolyomX Program, Cross Cancer Institute, Alberta Health Services, Edmonton, Alberta, Canada;Department of Oncology, University of Alberta, Edmonton, Canada; | |
关键词: Breast Cancer; Feature Selection; Sporadic Breast Cancer; Gail Model; Breast Cancer Dataset; | |
DOI : 10.1186/1471-2105-14-S13-S3 | |
来源: Springer | |
【 摘 要 】
BackgroundThis paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.ResultsWe first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline.ConclusionsWe anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.
【 授权许可】
Unknown
© Hajiloo et al; licensee BioMed Central Ltd. 2013. This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202311096877373ZK.pdf | 1505KB | download |
【 参考文献 】
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
- [32]
- [33]
- [34]
- [35]
- [36]
- [37]
- [38]
- [39]
- [40]
- [41]
- [42]
- [43]
- [44]
- [45]
- [46]
- [47]
- [48]