BMC Proceedings | |
A ν-support vector regression based approach for predicting imputation quality | |
Proceedings | |
Jay A Tischfield1  John P Rice2  Scott F Saccone2  Yigal Arens3  José Luis Ambite3  Yi-Hung Huang4  Chun-Nan Hsu5  | |
[1] Department of Genetics, Rutgers University, Piscataway, New Jersey, USA;Department of Psychiatry, Washington University, St. Louis, Missouri, USA;Information Science Institute, University of Southern California, Marina del Rey, California, USA;Institute of Information Science, Academia Sinica, Taipei 115, Taiwan;Department of Computer Science, National Taiwan University, Taipei 106, Taiwan;Institute of Information Science, Academia Sinica, Taipei 115, Taiwan;Information Science Institute, University of Southern California, Marina del Rey, California, USA; | |
关键词: Reference Panel; Imputation Accuracy; Impute SNPs; True Genotype; Lung Cancer Sample; | |
DOI : 10.1186/1753-6561-6-S7-S3 | |
来源: Springer | |
【 摘 要 】
BackgroundDecades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown.MethodsHere, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot, etc., for the prediction of IQS. We evaluated our regression models by estimating IQS of imputations by BEAGLE for a set of GWAS data from the NCBI GEO database collected from samples from different ethnic populations.ResultsWe construct a ν-SVR based approach as our regression model. Our evaluation shows that this regression model can accomplish mean square errors of less than 0.02 and a correlation coefficient close to 0.75 in different imputation scenarios. We also show how the regression results can help remove false positives in association studies.ConclusionReliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.
【 授权许可】
CC BY
© Huang et al.; licensee BioMed Central Ltd. 2012
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202311109497303ZK.pdf | 2040KB | download |
【 参考文献 】
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]