期刊论文

【摘要】

BackgroundMetagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression.ResultsMetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes.ConclusionsWe described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline.AvailabilityThe MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.

【授权许可】

CC BY
© Kim et al. 2016

【预览】

附件列表
Files	Size	Format	View
RO202311099126484ZK.pdf	1221KB	PDF	download
12864_2017_3783_Article_IEq3.gif	1KB	Image	download
12864_2017_4309_Article_IEq11.gif	1KB	Image	download
12864_2017_3604_Article_IEq1.gif	1KB	Image	download

【图表】

12864_2017_3604_Article_IEq1.gif

12864_2017_4309_Article_IEq11.gif

12864_2017_3783_Article_IEq3.gif

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]

BMC Bioinformatics
MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression
Methodology Article
Farzad Farnoud¹ Venugopal V. Veeravalli² Minji Kim² Xiejia Zhang² Olgica Milenkovic² Jonathan G. Ligo²
[1] Department of Electrical Engineering, California Institute of Technology, 91125, Pasadena, USA;Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 61801, Urbana, USA;
关键词: Metagenomics; Genomic compression; Parallel algorithms;
DOI : 10.1186/s12859-016-0932-x
received in 2015-10-05, accepted in 2016-02-02, 发布年份 2016
来源: Springer
PDF


	文献评价指标
	下载次数：15次	浏览次数：3次

【 摘 要 】

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】