期刊论文详细信息
BioData Mining
Evaluation of different approaches for missing data imputation on features associated to genomic data
Hugo Naya1  Lucía Spangenberg2  Ben Omega Petrazzini3  Gustavo Vazquez4  Fernando Lopez-Bello5 
[1] Bioinformatics Unit, Institut Pasteur de Montevideo, Mataojo 2020, 11400, Montevideo, Uruguay;Departamento de Producción Animal y Pasturas, Facultad de Agronomía, Universidad de la República, 12900, Montevideo, Uruguay;Bioinformatics Unit, Institut Pasteur de Montevideo, Mataojo 2020, 11400, Montevideo, Uruguay;Department of Informatics and Computer Science, Universidad Católica del Uruguay, Av. 8 de Octubre, 2738, 11600, Montevideo, Uruguay;Bioinformatics Unit, Institut Pasteur de Montevideo, Mataojo 2020, 11400, Montevideo, Uruguay;The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA;Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA;Department of Informatics and Computer Science, Universidad Católica del Uruguay, Av. 8 de Octubre, 2738, 11600, Montevideo, Uruguay;PEDECIBA Bioinformática, Universidad de la República, Montevideo, Uruguay;
关键词: Machine learning;    imputation;    missing data;    genomics;    pathogenic variants;   
DOI  :  10.1186/s13040-021-00274-7
来源: Springer
PDF
【 摘 要 】

BackgroundMissing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features.ResultsRandom Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set.ConclusionsWe found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202110149878488ZK.pdf 882KB PDF download
  文献评价指标  
  下载次数:4次 浏览次数:7次