期刊论文

【摘要】

Background

Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes.

Results

We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap’s accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10% to 30% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes.

Conclusions

We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code in CUDA and OpenCL is freely available from http://www.cs.njit.edu/usman/MaxSSmap webcite.

【授权许可】

2014 Turki and Roshan; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150221020444539.pdf	441KB	PDF	download
Figure 2.	17KB	Image	download
Figure 1.	35KB	Image	download

【图表】

Figure 1.

Figure 2.

【参考文献】

[1]Lunter G, Goodson M: Stampy: A statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res 2011, 21(6):936-939. doi:10.1101/gr.111120.110
[2]Wang Z, Gerstein M, Snyder M: RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10(1):57-63. doi:10.1093/bioinformatics/btn025
[3]Reynoso V, Putonti C: Mapping short sequencing reads to distant relatives. In Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine. BCB ’11, Chicago, Illinois, USA: ACM; 2011:420-424.
[4]Collins LJ, Biggs PJ, Voelckel C, Joly S: An approach to transcriptome analysis of non-model organisms using short-read sequences. Genome Inf 2008, 21:3-14.
[5]Seabury CM, Bhattarai EK, Taylor JF, Viswanathan GG, Cooper SM, Davis DS, Dowd SE, Lockwood ML, Seabury PM: Genome-wide polymorphism and comparative analyses in the white-tailed deer (odocoileus virginianus): a model for conservation genomics. PLoS ONE 2011, 6(1):15811.
[6]Liang C, Schmid A, Lopez-Sanchez M, Moya A, Gross R, Bernhardt J, Dandekar T: JANE: efficient mapping of prokaryotic ests and variable length sequence reads on related template genomes. BMC Bioinformatics 2009, 10(1):391. BioMed Central Full Text
[7]Au KF, Underwood JG, Lee L, Wong WH: Improving pacbio long read accuracy by short read alignment. PLoS ONE 2012, 7(10):46679. doi:10.1371/journal.pone.0046679
[8]Lieberman TD, Flett KB, Yelin I, Martin TR, McAdam AJ, Priebe GP, Kishony R: Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat Genet 2014, 46:82-87. doi:10.1038/ng.2848
[9]Prufer K, Stenzel U, Hofreiter M, Paabo S, Kelso J, Green R: Computational challenges in the analysis of ancient dna. Genome Biol 2010, 11(5):47. doi:10.1186/gb-2010-11-5-r47 BioMed Central Full Text
[10]Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet 2012, 44:226-232.
[11]Fonseca NA, Rung J, Brazma A, Marioni JC: Tools for mapping high-throughput sequencing data. Bioinformatics 2012, 28(24):3169-3177.
[12]Hatem A, Bozdag D, Toland A, Catalyurek U: Benchmarking short sequence mapping tools. BMC Bioinformatics 2013, 14(1):184. doi:10.1186/1471-2105-14-184 BioMed Central Full Text
[13]Sedlazeck FJ, Rescheneder P, von Haeseler A: NextGenMap: Fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 2013, 29(21):2790-2791. doi:10.1093/bioinformatics/btt468
[14]Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195-197.
[15]Bentley J: Programming Pearls. Boston, Massachussets: Addison-Wesley; 1986.
[16]Bates JL, Constable RL: Proofs as programs. ACM Trans Program Lang Syst 1985, 7:113-136.
[17]Zhao M, Lee W-P, Garrison EP, Marth GT: SSW library: An simd smith-waterman c/c++ library for use in genomic applications. PLoS ONE 2013, 8(12):82138.
[18]Liu Y, Maskell D, Schmidt B: CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res Notes 2009, 2(1):73. doi:10.1186/1756-0500-2-73 BioMed Central Full Text
[19]NVIDIA: CUDA C Programming Guide 4.0. 2011. [http://developer.nvidia.com/cuda-toolkit webcite]
[20]Sanders J, Kandrot E: CUDA by Example: An Introduction to General-Purpose GPU Programming. Boston, Massachussets: Addison-Wesley Professional; 2010.
[21]Kirk DB, Hwu W-mW: Programming Massively Parallel Processors: A Hands-on Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2010.
[22]Stone JE, Gohara D, Shi G: OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Des Test 2010, 12(3):66-73. doi:10.1109/MCSE.2010.69
[23]Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. i. accuracy assessment. Genome Res 1998, 8(3):175-185.
[24]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The sequence alignment/map format and samtools. Bioinformatics 2009, 25(16):2078-2079.
[25]NVIDIA: CUDA C Best Practices 4.0. 2011. [http://developer.nvidia.com/cuda-toolkit webcite]
[26]Dagum L, Menon R: OpenMP: An industry-standard API for shared-memory programming. IEEE Comput Sci Eng 1998, 5(1):46-55. [http://dx.doi.org/10.1109/99.660313 webcite]
[27]Li H, Durbin R: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics 2009, 25(14):1754-1760. doi:10.1093/bioinformatics/btp324
[28]Klus P, Lam S, Lyberg D, Cheung M, Pullan G, McFarlane I, Yeo G, Lam B: BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC Res Notes 2012, 5(1):27. doi:10.1186/1756-0500-5-27 BioMed Central Full Text
[29]Schatz M, Trapnell C, Delcher A, Varshney A: High-throughput sequence alignment using graphics processing units. BMC Bioinformatics 2007, 8(1):474. BioMed Central Full Text
[30]Liu Y, Schmidt B, Maskell DL: CUSHAW: a cuda compatible short read aligner to large genomes based on the burrows–wheeler transform. Bioinformatics 2012, 28(14):1830-1837. doi:10.1093/bioinformatics/bts276
[31]Liu C-M, Wong T, Wu E, Luo R, Yiu S-M, Li Y, Wang B, Yu C, Chu X, Zhao K, Li R, Lam T-W: SOAP3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics 2012, 28(6):878-879.
[32]Orlando L, Ginolhac A, Raghavan M, Vilstrup J, Rasmussen M, Magnussen K, Steinmann KE, Kapranov P, Thompson JF, Zazula G, Froese D, Moltke I, Shapiro B, Hofreiter M, Al-Rasheid KAS, Gilbert MTP, Willerslev E: True single-molecule dna sequencing of a pleistocene horse bone. Genome Res 2011, 21(10):1705-1719. doi:10.1101/gr.122747.111
[33]Schubert M, Ginolhac A, Lindgreen S, Thompson J, AL-Rasheid K, Willerslev E, Krogh A, Orlando L: Improving ancient dna read mapping against modern reference genomes. BMC Genomics 2012, 13(1):178. doi:10.1186/1471-2164-13-178 BioMed Central Full Text

BMC Genomics
MaxSSmap: a GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence

Usman Roshan¹ Turki Turki¹
[1] Department of Computer Science, New Jersey Institute of Technology, GITC 4400, University Heights, Newark, USA
关键词: NGS; GPU; Divergent; Alignment;
Others : 1127610 DOI : 10.1186/1471-2164-15-969

received in 2014-10-01, accepted in 2014-11-06, 发布年份 2014
PDF


	文献评价指标
	下载次数：24次	浏览次数：19次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】