| BMC Research Notes | |
| Molgenis-impute: imputation pipeline in a box | |
| Morris A Swertz1  Martijn Dijkstra1  Heorhiy Byelas1  Freerk van Dijk1  Patrick Deelen1  Alexandros Kanterakis1  | |
| [1] Department of Genetics, Genomics Coordination Center, University Medical Center Groningen and University of Groningen, Genetics, UMCG, Groningen, 9700 RB, The Netherlands | |
| 关键词: GWAS; Genotyping; Imputation; | |
| Others : 1230877 DOI : 10.1186/s13104-015-1309-3 |
|
| received in 2014-07-07, accepted in 2015-07-30, 发布年份 2015 | |
PDF
|
|
【 摘 要 】
Background
Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters.
Results
Here we present MOLGENIS-impute, an ‘imputation in a box’ solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment.
Conclusions
MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional computational steps or it can be included in other bioinformatics pipelines. It is available as open source from: https://github.com/molgenis/molgenis-imputation.
【 授权许可】
2015 Kanterakis et al.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20151108035518969.pdf | 1462KB | ||
| Fig.2. | 135KB | Image | |
| Fig.1. | 85KB | Image |
【 图 表 】
Fig.1.
Fig.2.
【 参考文献 】
- [1]Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 2007, 39(7):906-913.
- [2]Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 2009, 5(6):e1000529.
- [3]Lu JT, Wang Y, Gibbs RA, Yu F: Characterizing linkage disequilibrium and evaluating imputation power of human genomic insertion-deletion polymorphisms. Genome Biol 2012, 13(2):R15. BioMed Central Full Text
- [4]Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, Zanon C, et al.: A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet 2011, 43(4):316-320.
- [5]Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2009, 84(2):210-223.
- [6]Uh HW, Deelen J, Beekman M, Helmer Q, Rivadeneira F, Hottenga JJ, et al.: How to deal with the early GWAS data when imputing and combining different arrays is necessary. Eur J Hum Genet 2012, 20(5):572-576.
- [7]Nalls MA, Plagnol V, Hernandez DG, Sharma M, Sheerin UM, et al.: International Parkinson Disease Genomics Consortium: Imputation of sequence variants for identification of genetic risks for Parkinson’s disease: a meta-analysis of genome-wide association studies. Lancet 2011, 377(9766):641-649.
- [8]Johansen TC, Wang J, Lanktree BM, Cao H, McIntyre DA, Ban RM, et al.: Excess of rare variants in genes identified by genome-wide association study of hypertri-glyceridemia. Nat Genet 2010, 42(8):684-687.
- [9]Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR: Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 2012, 44(8):955-959.
- [10]Hao K, Chudin E, McElwee J, Schadt EE: Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet 2009, 10:27. BioMed Central Full Text
- [11]Marchini J, Howie B: Genotype imputation for genome-wide association studies. Nat Rev Genet 2010, 11(7):499-511.
- [12]Nho K, Shen L, Kim S, Swaminathan S, Risacher SL, Saykin AJ et al (2011) The effect of reference panels and software tools on genotype imputation. In: Proceedings of the Annual AMIA Symposium: 22–26 October 2011. AMIA, Washington DC, pp 1013–1018
- [13]Pei YF, Li J, Zhang L, Papasian CJ, Deng HW (2008) Analyses and comparison of accuracy of different genotype imputation methods. PLoS One 3(10):e3551
- [14]Hancock DB, Levy LJ, Gaddis CN, Bierut JL, Saccone LN, Page PG, et al.: Assessment of genotype imputation performance using 1,000 Genomes in African American studies. PLoS One 2012, 7(11):e50610.
- [15]Laughbaum A (2013) Comparing BEAGLE, IMPUTE2, and Minimac Imputation methods for accuracy, computation time, and memory usage. http://blog.goldenhelix.com/?p=1911. Accessed 11 Aug 2015
- [16]O’Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, et al.: A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet 2014, 10(4):e1004234.
- [17]Goecks J, Nekrutenko A, Taylor J: Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. BioMed Central Full Text
- [18]Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, et al.: The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res 2013, 41(Web Server issue):W557-W561.
- [19]Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, et al.: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 2010, 26(12):1488-1492.
- [20]Estrada K, Abuseiris A, Grosveld FG, Uitterlinden AG, Knoch TA, Rivadeneira F: GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data. Bioinformatics 2009, 25(20):2750-2752.
- [21]Byelas H, Kanterakis A, Swertz MA (2013) Towards a Molgenis-based computational framework. In: Kilpatrick P, Milligan P, Stotzka R (eds) Proceedings of IEEE 19th EUROMICRO International Conference on Parallel, Distributed and Network-Based Computing: 27 Feb-1 Mar 2013. CPS, Belfast UK, pp 331–339
- [22]Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81(3):559-575.
- [23]Deelen P, Bonder MJ, van der Velde KJ, Westra H-J, Winder E, Hendriksen D, et al.: Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res Notes 2014, 7:901. BioMed Central Full Text
- [24]GIANT consortium (2015) http://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium. Accessed 11 Aug 2015
- [25]Voight BF, Kang HM, Ding J, Palmer CD, Sidore C, Chines PS, et al.: The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet 2012, 8(8):e1002793.
- [26]Howie BN, Donnelly P, Marchini J (2014) 1,000 Genomes haplotypes—Phase 3 integrated variant set release in NCBI build 37 (hg19) coordinates. http://mathgen.stats.ox.ac.uk/impute/1000GP%20Phase%203%20haplotypes%206%20October%202014.html. Accessed 11 Aug 2015
- [27]Staples G (2006) TORQUE resource manager. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. 11–17 November 2006. ACM, Tampa, p 8
- [28]Nabrzyski J, Schopf JM, Węglarz J (eds) (2004) Grid Resource Management. International Series in Operations Research & Management Science, vol 64. Springer US, Boston, MA
- [29]Byelas H, Swertz MA (2013) Scaling bio-analyses from computational clusters to grids. In: Kiss T (ed) Proceedings of the 5th International Workshop on Science Gateways (IWSG 2013): 3–5 June 2013. Published on CEUR-WS.org, Zurich p 8
- [30]Furlani JL (1991) Modules: providing a flexible user environment. In: Proceedings of the fifth large installation systems administration conference (LISA V), pp 141–152
- [31]Francioli CL, Menelaou A, Pulit LS, van Dijk F, Palamara FP, Elbers CC, et al.: Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet 2014, 46(8):818-825.
- [32]Boomsma DI, Wijmenga C, Slagboom EP, Swertz MA, Karssen LC, Abdellaoui A, et al.: The Genome of the Netherlands: design, and project goals. Eur J Hum Genet 2014, 22(2):221-227.
- [33]Deelen P, Menelaou A, van Leeuwen EM, Kanterakis A, van Dijk F, Medina-Gomez C, et al.: Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur J Hum Genet 2014, 22(11):1321-1326.
- [34]de Jong SW, Huisman MH, Sutedja NA, van der Kooi AJ, de Visser M, Schelhaas HJ, et al.: Smoking, alcohol consumption, and the risk of amyotrophic lateral sclerosis: a population-based study. Am J Epidemiol 2012, 176(3):233-239.
- [35]Hofman A, Darwish Murad S, van Duijn CM, Franco OH, Goedegebure A, Ikram MA, et al.: The Rotterdam Study: 2014 objectives and design update. Eur J Epidemiol 2013, 28(11):889-926.
- [36]Wlazlo N, van Greevenbroek MM, Ferreira I, Jansen EH, Feskens EJ, van der Kallen CJ, et al.: Iron metabolism is associated with adipocyte insulin resistance and plasma adiponectin: the Cohort on Diabetes and Atherosclerosis Maastricht (CODAM) study. Diabetes Care 2013, 36(2):309-315.
- [37]Boomsma DI, Vink JM, van Beijsterveldt TC, de Geus EJ, Beem AL, Mulder EJ, et al.: etherlands twin register: a focus on longitudinal research. Twin Res 2002, 5:401-406.
- [38]Stolk RP, Rosmalen JG, Postma DS, de Boer RA, Navis G, Slaets JP, et al.: Universal risk factors for multifactorial diseases: LifeLines: a three-generation population-based study. Eur J Epidemiol 2008, 23(1):67-74.
- [39]Schoenmaker M, de Craen AJ, de Meijer PH, Beekman M, Blauw GJ, Slagboom PE, et al.: Evidence of genetic enrichment for exceptional survival using a family approach: the Leiden Longevity Study. Eur J Hum Genet 2006, 14(1):79-84.
- [40]Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC Bioinform 2010, 11:259. BioMed Central Full Text
- [41]Schatz MC, Langmead B, Salzberg SL: Cloud computing and the DNA data race. Nat Biotechnol 2010, 28(7):691-693.
- [42]Stevens RD, Robinson AJ, Goble CA: myGrid: personalised bioinformatics on the information grid. Bioinformatics 2003, 19(Suppl 1):i302-i304.
- [43]Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, et al.: Best practices for scientific computing. PLoS Biol 2014, 12(1):e1001745.
- [44]Molgenis Compute 5 User Guide (2015) https://rawgit.com/molgenis/molgenis-compute/master/molgenis-compute-core/README.html. Accessed 11 Aug 2015
PDF