期刊论文详细信息
GigaScience
Tentacle: distributed quantification of genes in metagenomes
Erik Kristiansson1  Anders Sjögren1  Fredrik Boulund1 
[1]Division of Statistics, Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
关键词: DNA sequencing;    Read mapping;    DNA sequence analysis;    Gene quantification;    Metagenomics;    Next-generation sequencing;    Master-worker;    Distributed computing;   
Others  :  1224264
DOI  :  10.1186/s13742-015-0078-1
 received in 2015-02-04, accepted in 2015-08-05,  发布年份 2015
PDF
【 摘 要 】

Background

In metagenomics, microbial communities are sequenced at increasingly high resolution, generating datasets with billions of DNA fragments. Novel methods that can efficiently process the growing volumes of sequence data are necessary for the accurate analysis and interpretation of existing and upcoming metagenomes.

Findings

Here we present Tentacle, which is a novel framework that uses distributed computational resources for gene quantification in metagenomes. Tentacle is implemented using a dynamic master-worker approach in which DNA fragments are streamed via a network and processed in parallel on worker nodes. Tentacle is modular, extensible, and comes with support for six commonly used sequence aligners. It is easy to adapt Tentacle to different applications in metagenomics and easy to integrate into existing workflows.

Conclusions

Evaluations show that Tentacle scales very well with increasing computing resources. We illustrate the versatility of Tentacle on three different use cases. Tentacle is written for Linux in Python 2.7 and is published as open source under the GNU General Public License (v3). Documentation, tutorials, installation instructions, and the source code are freely available online at: http://bioinformatics.math.chalmers.se/tentacle.

【 授权许可】

   
2015 Boulund et al.

【 预 览 】
附件列表
Files Size Format View
20150909044038491.pdf 703KB PDF download
Fig. 4. 47KB Image download
Figure 1. 51KB Image download
Fig. 2. 58KB Image download
Fig. 1. 16KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Figure 1.

Fig. 4.

【 参考文献 】
  • [1]Baker M. Next-generation sequencing: adjusting to data overload. Nature Methods. 2010;7. Available from:. http://dx. doi.org/10.1038/nmeth0710-495 webcite
  • [2]Cochrane G, Alako B, Amid C, Bower L, Ana C, Cleland I et al.. Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 2013; 41(Database issue):D30-D35.
  • [3]Scholz M, Lo C. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol. 2012; 23(1):9-15.
  • [4]Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev: MMBR. 2004; 68(4):669-85.
  • [5]Gilbert J, Dupont C. Microbial metagenomics: beyond the genome. Ann Rev Mar Sci. 2011; 3:347-71.
  • [6]Qin J, Li R, Raes J, Arumugam M, Burgdorf K, Manichanh C et al.. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010; 464(7285):59-65.
  • [7]Turnbaugh P, Ley R, Hamady M, Claire F, Knight R, Gordon J. The human microbiome project. Nature. 2007; 449(7164):804-810.
  • [8]Gilbert JA, Bailey M, Field D, Fierer N, Fuhrman JA et al.. The Earth Microbiome Project: The meeting report for the 1st International Earth Microbiome Project Conference, Shenzhen, China, June 13th-15th 2011. Stand Genomic Sci. 2011; 5(2):243.
  • [9]Gilbert J, Jansson J, Knight R. The Earth Microbiome project: successes and aspirations. BMC Biology. 2014; 12(1):69. BioMed Central Full Text
  • [10]Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, et al. The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2014. Available from:. http://dx. doi.org/10.1093/nar/gku950 webcite
  • [11]Curtis T, Sloan W, Scannell J. Estimating prokaryotic diversity and its limits. Proc Natl Acad Sci USAs. 2002; 99(16):10494-9.
  • [12]Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010; 11(5):473-83.
  • [13]Hatem A, Bozdag D, Toland A, Çatalyürek UV. Benchmarking short sequence mapping tools. BMC Bioinforma. 2013; 14(1):184.
  • [14]Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W et al.. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389-402.
  • [15]Grant J, Dunbrack R, Manion F, Ochs M. BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster. Bioinformatics (Oxford, England). 2002; 18(5):765-6.
  • [16]Carvalho P, Glória R, de Miranda A, Degrave W. Squid - a simple bioinformatics grid. BMC bioinforma. 2005; 6(1):197.
  • [17]Yang C, Han T, Kan H. G-BLAST: a Grid-based solution for mpiBLAST on computational Grids. Concurr Comput: Pract Exper. 2009; 21(2):225-55.
  • [18]Darling A, Carey L, Feng Wc. The design, implementation, and evaluation of mpiBLAST (Best Paper: Applications Track). 4th International Conference on Linux Clusters: The HPC Revolution 2003 in conjunction with ClusterWorld Conference & Expo. 2003:14.
  • [19]Wang J, Mu Q. Soap-HT-BLAST: high throughput BLAST based on Web services. Bioinformatics (Oxford, England). 2003; 19(14):1863-4.
  • [20]Dowd S, Zaragoza J, Rodriguez J, Oliver M, Payton P. Windows.NET network distributed basic local alignment search toolkit (W.ND-BLAST). BMC bioinformatics. 2005; 6(1):93.
  • [21]Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR et al.. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC bioinformatics. 2011; 12(1):356. BioMed Central Full Text
  • [22]Pandey RV, Schlötterer C. DistMap: A toolkit for distributed short read mapping on a hadoop cluster. PLoS ONE. 2013; 8(8):e72614.
  • [23]Schatz M. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics (Oxford, England). 2009; 25(11):1363-9.
  • [24]Nguyen T, Shi W, Ruden D. CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC research notes. 2011; 4(1):171.
  • [25]Karczewski KJ, Fernald GH, Martin AR, Snyder M, Tatonetti NP, Dudley JT. STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS ONE. 2014; 9(1):e84860.
  • [26]Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK et al.. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010; 7(5):335-6.
  • [27]iMatix Corporation. ZeroMQ. 2014. Available from: http://www.zeromq.org/. Accessed 22 Aug 2015.
  • [28]Hannon lab. FASTX-Toolkit. 2014. Available from: http://hannonlab.cshl.edu/fastx_toolkit. Accessed 22 Aug 2015.
  • [29]Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nat methods. 2012; 9(4):357-359.
  • [30]Kent W. BLAT–the BLAST-like alignment tool. Genome res. 2002; 12(4):656-664.
  • [31]Meng W. pblat – blat with multi-threads support. 2015. Available from: http://icebert.github.io/pblat/. Accessed 22 Aug 2015.
  • [32]Santiago M, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat methods. 2012; 9(12):1185-8.
  • [33]Weese D, Holtgrewe M, Reinert K. RazerS 3 faster, fully sensitive read mapping. Bioinformatics (Oxford, England). 2012; 28(20):2592-9.
  • [34]Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010; 26(19):2460-1.
  • [35]Boulund F, Sjögren A, Kristiansson E. Tentacle. 2014. Available from: http://bioinformatics.math.chalmers.se/tentacle/. Accessed 22 Aug 2015.
  • [36]SchedMD. Slurm. 2014. Available from: http://slurm.schedmd.com/. Accessed 22 Aug 2015.
  • [37]Pérez F, Granger BE. IPython: a System for Interactive Scientific Computing. Comput Sci Eng. 2007; 9(3):21-9.
  • [38]Boulund F, Sjögren A, Kristiansson E. Tentacle scaling benchmark. 2015. Available from:. http://dx. doi.org/10.6084/m9.figshare.1403608 webcite
  • [39]Atallah MJ. Algorithms and theory of computation handbook: Danvers, MA: CRC press; 1998.
  • [40]Forum MPI. MPI: A message-passing interface standard. Version 3.0. 2012. Available from:. http://www. mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf webcite
  • [41]Gottumukkala N, Nassar R, Paun M, Leangsuksun C, Scott S. Reliability of a System of k Nodes for High Performance Computing Applications. IEEE Trans Reliab. 2010; 59(1):162-9.
  • [42]Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A et al.. A view of cloud computing. Commun ACM. 2010; 53(4):50-8.
  • [43]Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008; 51(1):107-13.
  • [44]White T. Hadoop: The definitive guide. O’Reilly Media, Inc, Sebastobol, California; 2012.
  • [45]Mande S, Mohammed M, Ghosh T. Classification of metagenomic sequences: methods and challenges. Brief Bioinform. 2012; 13(6):669-81.
  • [46]Schbath S, Martin V, Zytnicki M, Fayolle J, Loux V, Gibrat J. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis. J Comput Biol: J Mol Cell Biol. 2012; 19(6):796-813.
  • [47]Roguski L, Deorowicz S. DSRC 2–Industry-oriented compression of FASTQ files. Bioinformatics. 2014; 30(15):2213-5.
  • [48]Rodgers DP. Improvements in Multiprocessor System Design. SIGARCH Comput Archit News. 1985; 13(3):225-31.
  • [49]Meyer F, Paarmann D, D’Souza M, Olson R, Glass E, Kubal M et al.. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008; 9:386. BioMed Central Full Text
  • [50]Langmead B, Trapnell C, Pop M, Salzberg S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25.
  • [51]Boulund F, Sjögren A, Kristiansson E. Tentacle open source repository at Bitbucket. 2014. Available from: http://www.bitbucket.org/chalmersmathbioinformatics/tentacle. Accessed 22 Aug 2015.
  • [52]Boulund F, Sjögren A, Kristiansson E. Supporting materials and software for “Tentacle: distributed quantification of genes in metagenomes”. 2015. GigaScience Database. http://dx. doi.org/10.5524/100152 webcite
  • [53]Kristiansson E. 1928 Diagnostics. Resqu. 2014. Available from: http://www.1928diagnostics.com/resdb/. Accessed 22 Aug 2015.
  文献评价指标  
  下载次数:56次 浏览次数:35次