期刊论文

【摘要】

BackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.

【授权许可】

CC BY
© The Author(s). 2017

【预览】

附件列表
Files	Size	Format	View
RO202311109714157ZK.pdf	884KB	PDF	download

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]

BMC Bioinformatics
K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity
Methodology Article
Kirk E. Jordan¹ Vipin Sachdeva² Martyn D. Winn³ Chang Sik Kim⁴
[1] Computational Science Center, IBM T.J. Watson Research, Cambridge, MA, USA;Computational Science Center, IBM T.J. Watson Research, Cambridge, MA, USA;Present addresse Silicon Therapeutics, 300 A Street, Boston, MA, USA;The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory, WA4 4AD, Warrington, UK;The Hartree Centre and Scientific Computing Department, STFC Daresbury Laboratory, WA4 4AD, Warrington, UK;Present addresse Cancer Research UK Manchester Institute, The University of Manchester, M20 4BX, Manchester, UK;
关键词: MapReduce; De novo sequence assembly; RNA-Seq; Trinity;
DOI : 10.1186/s12859-017-1881-8
received in 2017-06-19, accepted in 2017-10-26, 发布年份 2017
来源: Springer
PDF


	文献评价指标
	下载次数：17次	浏览次数：1次

【 摘 要 】

【 授权许可】

【 预 览 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【参考文献】