期刊论文详细信息
GigaScience
Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data
Ola Spjuth1  Jonas Hagberg3  Pall I Olason2  Martin Dahlö1  Samuel Lampa3 
[1] Science for Life Laboratory, Uppsala University, Husargatan 3, SE-751 23, Uppsala, Sweden;Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, SE-752 36, Uppsala, Sweden;SNIC-UPPMAX, Uppsala University, PO Box 337, SE-751 05, Uppsala, Sweden
关键词: Data analysis;    Genomics;    Bioinformatics;    High-performance computing;    Infrastructure;    Next-generation sequencing;   
Others  :  861612
DOI  :  10.1186/2047-217X-2-9
 received in 2012-11-20, accepted in 2013-06-01,  发布年份 2013
PDF
【 摘 要 】

Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

【 授权许可】

   
2013 Lampa et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140725002801912.pdf 646KB PDF download
70KB Image download
26KB Image download
【 图 表 】

【 参考文献 】
  • [1]Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11:31-46.
  • [2]Schuster SC: Next-generation sequencing transforms today’s biology. Nat Methods 2008, 5:16-18.
  • [3]Hall N: Advanced sequencing technologies and their wider impact in microbiology. J Exp Biol 2007, 210(Pt 9):1518-1525.
  • [4]Koboldt DC, Ding L, Mardis ER, Wilson RK: Challenges of sequencing human genomes. Brief Bioinform 2010, 11:484-98.
  • [5]Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL: Searching for SNPs with cloud computing. Genome Biol 2009, 10:R134. BioMed Central Full Text
  • [6]Kris W: DNA sequencing costs: data from the NHGRI large-scale genome sequencing program. [http://www.genome.gov/sequencingcosts/ webcite]
  • [7]Goecks J, Nekrutenko A, Taylor J, Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11:R86. BioMed Central Full Text
  • [8]Blankenberg D, Kuster GV, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, 1-13. Chapter 19
  • [9]Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005, 15:1451-1455.
  • [10]Hunter A, Macgregor A, Szabo T, Wellington C, Bellgard M: Yabi: An online research environment for grid, high performance and cloud computing. Source Code Biol Med 2012, 7:1. BioMed Central Full Text
  • [11]Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20:3045-3054.
  • [12]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43:491-498.
  • [13]Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP: Computational solutions to large-scale data management and analysis. Nat Rev Genet 2010, 11:647-57.
  • [14]Sneddon TP, Li P, Edmunds SC: GigaDB: announcing the GigaScience database. GigaScience 2012, 1:11+. BioMed Central Full Text
  • [15]Cochrane G, Cook CE, Birney E: The future of DNA sequence archiving. GigaScience 2012, 1:2+. BioMed Central Full Text
  • [16]Kodama Y, Shumway M, Leinonen R: International Nucleotide Sequence Database Collaboration: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res 2012, 40(Database issue):D54-D56.
  • [17]Bentley DR, et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456:53-59.
  • [18]McKernan KJ, et al.: Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 2009, 19:1527-1541.
  • [19]Margulies M, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437:376-380.
  • [20]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.
  • [21]The Knut and Alice Wallenberg foundation [http://www.wallenberg.com/kaw/en/ webcite]
  • [22]Swedish national infrastructure for computing [http://www.snic.se webcite]
  • [23]Ameur A, Stewart JB, Freyer C, Hagström E, Ingman M, Larsson NG, Gyllensten U: Ultra-deep sequencing of mouse mitochondrial DNA: mutational patterns and their origins. PLoS Genet 2011, 7:e1002028.
  • [24]Andersson LS, Larhammar M, Memic F, Wootz H, Schwochow D, Rubin CJ, Patra K, Arnason T, Wellbring L, Hjälm G, Imsland F, Petersen JL, McCue ME, Mickelson JR, Cothran G, Ahituv N, Roepstorff L, Mikko S, Vallstedt A, Lindgren G, Andersson L, Kullander K: Mutations in DMRT3 affect locomotion in horses and spinal circuit function in mice. Nature 2012, 488:642-646.
  • [25]Edlund K, Larsson O, Ameur A, Bunikis I, Gyllensten U, Leroy B, Sundström M, Micke P, Botling J, Soussi T: Data-driven unbiased curation of the TP53 tumor suppressor gene mutation database and validation by ultradeep sequencing of human tumors. Proc Natl Acad Sci U S A 2012, 109:9551-9556.
  • [26]Holmqvist PH, Boija A, Philip P, Crona F, Stenberg P, Mannervik M: Preferential genome targeting of the CBP co-activator by Rel and Smad proteins in early Drosophila melanogaster embryos. PLoS Genet 2012, 8:e1002769.
  • [27]Mansouri L, Gunnarsson R, Sutton LA, Ameur A, Hooper SD, Mayrhofer M, Juliusson G, Isaksson A, Gyllensten U, Rosenquist R: Next generation RNA-sequencing in prognostic subsets of chronic lymphocytic leukemia. Am J Hematol 2012, 87:737-740.
  • [28]Rubin CJ, Zody MC, Eriksson J, Meadows JRS, Sherwood E, Webster MT, Jiang L, Ingman M, Sharpe T, Ka S, Hallböök F, Besnier F, Carlborg O, Bed’hom B, Tixier-Boichard M, Jensen P, Siegel P, Lindblad-Toh K, Andersson L: Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 2010, 464:587-591.
  • [29]Sällman Almén M, Rask-Andersen M, Jacobsson JA, Ameur A, Kalnina I, Moschonis G, Juhlin S, Bringeland N, Hedberg LA, Ignatovica V, Chrousos GP, Manios Y, Klovins J, Marcus C, Gyllensten U, Fredriksson R, Schiöth HB: Determination of the obesity-associated gene variants within the entire FTO gene by ultra-deep targeted sequencing in obese and lean children. Int J Obes (Lond) 2013, 37:424-431.
  • [30]Zaboli G, Ameur A, Igl W, Johansson Å, Hayward C, Vitart V, Campbell S, Zgaga L, Polasek O, Schmitz G, van Duijn C, Oostra B, Pramstaller P, Hicks A, Meitinger T, Rudan I, Wright A, Wilson JF, Campbell H, Gyllensten U: EUROSPAN Consortium: Sequencing of high-complexity DNA pools for identification of nucleotide and structural variants in regions associated with complex traits. Eur J Hum Genet 2012, 20:77-83.
  • [31]Rsync [http://rsync.samba.org webcite]
  • [32]About SUNET [http://www.sunet.se/English/Home/About-SUNET.html webcite]
  • [33]End-to-end performance in GigaSunet [http://www.sunet.se/For-tekniker/Tekniskt-arkiv/End-to-end-performance.html webcite]
  • [34]SweStore - The Swedish Storage Initiative [http://www.snic.se/projects/swestore webcite]
  • [35]Chiang GT, Clapham P, Qi G, Sale K, Coates G: Implementing a genomic data management system using iRODS in the Wellcome Trust Sanger Institute. BMC Bioinformatics 2011, 12:361. BioMed Central Full Text
  • [36]de Bruijn NG: A combinatorial problem. Proc Koninklijke Nederlandse Akademie Wetenschappen 1946, 46:758-764.
  • [37]Yoo A, Jette M, Grondona M: SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing Volume 2862 of Lecture Notes in Computer Science. Edited by Feitelson D, Rudolph L, Schwiegelshohn U. Heidelberg: Springer Berlin Heidelberg; 2003:44-60.
  • [38]Furlani JL: Modules: Providing a flexible user environment. Proc Fifth Large Installation Syst Admn Conf (LISA V) 1991, 141-152.
  • [39]Strömberg M, Wan-Ping L: mosaik-aligner. [http://code.google.com/p/mosaik-aligner/ webcite]
  • [40]Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10:R25. BioMed Central Full Text
  • [41]Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25:1105-1111.
  • [42]Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009, 25:1754-1760.
  • [43]LifeScope Genomic Analysis Solutions 2012. [http://www.lifetechnologies.com/lifescope webcite]
  • [44]Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Res 2009, 19:1117-1123.
  • [45]Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18:821-829.
  • [46]Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 2004, 14:1147-1159.
  • [47]Roberts A, Pimentel H, Trapnell C, Pachter L: Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 2011, 27:2325-2329.
  • [48]Huelsenbeck JP, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 2001, 17:754-755.
  • [49]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25:2078-2079.
  • [50]Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38:e164.
  • [51]Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002, 12:1611-1618.
  • [52]Picard [http://picard.sourceforge.net/ webcite]
  • [53]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303.
  • [54]RT: Request Tracker[http://bestpractical.com/rt/ webcite]
  • [55]DeFrancesco L: Life Technologies promises [dollar]1,000 genome. Nat Biotech 2012, 30:126-126.
  • [56]Biosupport.se - Online bioinformatics support [http://www.biosupport.se webcite]
  • [57]BILS - Bioinformatics Infrastructure for Life Sciences [http://www.bils.se webcite]
  • [58]Wallenberg advanced bioinformatics infrastructure [http://scilifelab.uu.se/Bioinformatics/Bioinformatics webcite+support+support+]
  • [59]Science for Life Laboratory: SciLifeLab / bcbio-nextgen-deploy. [https://github.com/SciLifeLab/bcbio-nextgen-deploy webcite]
  • [60]Ellegren H, Smeds L, Burri R, Olason PI, Backström N, Kawakami T, Künstner A, Mäkinen H, Nadachowska-Brzyska K, Qvarnström A, Uebbing S, Wolf JBW: The genomic landscape of species divergence in Ficedula flycatchers. Nature 2012, 491:756-760.
  • [61]Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin YC, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, Vicedomini R, Sahlin K, Sherwood E, Elfstrand M, Gramzow L, Holmberg K, Hällman J, Keech O, Klasson L, Koriabine M, Kucukoglu M, Käller M, Luthman J, Lysholm F, Niittylä T, Olson A, Rilakovic N, Ritland C, Rosselló JA, Sena J, Svensson T, Talavera-López C, Theißen G, Vanneste K, Wu ZQ, Zhang B, Zerbe P, Arvestad L, Bhalerao R, Bohlmann J, Bousquet J, Garcia Gil R, Hvidsten TR, de Jong P, MacKay J, Morgante M, Ritland K, Sundberg B, Thompson SL, Van de Peer Y, Andersson B, Nilsson O, Ingvarsson PK, Lundeberg J, Jansson S, Tuominen H: The Norway spruce genome sequence and conifer genome evolution. Nature 2013, 497(7451):579-584.
  • [62]Richter BG, Sexton DP: Managing and analyzing next-generation sequence data. PLoS Comput Biol 2009, 5:1-4.
  • [63]Lewitter F, Rebhan M: The need for centralization of computational biology resources. PLoS Comput Biol 2009, 5:e1000368.
  • [64]Biowulf Linux cluster at the, National Institutes of Health, MD, USA [http://biowulf.nih.gov/ webcite]
  • [65]Bioinformatics Team (BioITeam) at the University of Texas [https://wikis.utexas.edu/display/bioiteam/Home webcite]
  • [66]Texas advanced computing center [http://www.tacc.utexas.edu/ webcite]
  • [67]High performance computing center at University of Florida [http://www.hpc.ufl.edu webcite]
  • [68]GenomeSpace [http://www.genomespace.org/ webcite]
  • [69]Nordgren J, Andersson P, Eriksson L, Sundquist B: Quality and renewal 2011: Kvalitet och Förnyelse 2011 (KoF11). an overall evaluation of research at Uppsala University 2010/2011. Report from Uppsala University, Department of Physics and Astronomy and Uppsala University, University Administration 2011, 1-638.
  文献评价指标  
  下载次数:49次 浏览次数:24次