期刊论文详细信息
BMC Genomics
Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data
David Hot1  Yves Lemoine2  Christophe Audebert1  Ségolène Caboche1 
[1] PEGASE-Biosciences, Institut Pasteur de Lille, 1 Rue du Professeur Calmette, 59019 Lille, France;Transcriptomics and Applied Genomics, Center of Infection and Immunity of Lille, Inserm U1019, CNRS UMR8204, Institut Pasteur de Lille, Univ Lille Nord de France, Lille, France
关键词: Read simulator;    Mapper comparison;    Mapping algorithms;    High-throughput sequencing;   
Others  :  1217533
DOI  :  10.1186/1471-2164-15-264
 received in 2013-11-13, accepted in 2014-04-01,  发布年份 2014
PDF
【 摘 要 】

Background

The rapid evolution in high-throughput sequencing (HTS) technologies has opened up new perspectives in several research fields and led to the production of large volumes of sequence data. A fundamental step in HTS data analysis is the mapping of reads onto reference sequences. Choosing a suitable mapper for a given technology and a given application is a subtle task because of the difficulty of evaluating mapping algorithms.

Results

In this paper, we present a benchmark procedure to compare mapping algorithms used in HTS using both real and simulated datasets and considering four evaluation criteria: computational resource and time requirements, robustness of mapping, ability to report positions for reads in repetitive regions, and ability to retrieve true genetic variation positions. To measure robustness, we introduced a new definition for a correctly mapped read taking into account not only the expected start position of the read but also the end position and the number of indels and substitutions. We developed CuReSim, a new read simulator, that is able to generate customized benchmark data for any kind of HTS technology by adjusting parameters to the error types. CuReSim and CuReSimEval, a tool to evaluate the mapping quality of the CuReSim simulated reads, are freely available. We applied our benchmark procedure to evaluate 14 mappers in the context of whole genome sequencing of small genomes with Ion Torrent data for which such a comparison has not yet been established.

Conclusions

A benchmark procedure to compare HTS data mappers is introduced with a new definition for the mapping correctness as well as tools to generate simulated reads and evaluate mapping quality. The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes has allowed us to validate our benchmark procedure and demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used. This benchmark procedure can be used to evaluate existing or in-development mappers as well as to optimize parameters of a chosen mapper for any application and any sequencing platform.

【 授权许可】

   
2014 Caboche et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150707025016407.pdf 1107KB PDF download
Figure 8. 112KB Image download
Figure 7. 44KB Image download
Figure 6. 58KB Image download
Figure 5. 30KB Image download
Figure 4. 54KB Image download
Figure 3. 66KB Image download
Figure 2. 28KB Image download
Figure 1. 20KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

【 参考文献 】
  • [1]Soon WW, Hariharan M, Snyder MP: High-throughput sequencing for biology and medicine. Mol Syst Biol 2013., 9(640) [http://www.ncbi.nlm.nih.gov/pubmed/23340846 webcite]
  • [2]Fonseca NA, Rung J, Brazma A, Marioni JC: Tools for mapping high-throughput sequencing data. Bioinformatics 2012. [http://bioinformatics.oxfordjournals.org/content/early/2012/10/11/bioinformatics.bts605.abstract webcite]
  • [3]Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics 2010, 11(5):473-483.
  • [4]Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ: Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 2011, 56(6):406-414.
  • [5]Ruffalo M, LaFramboise T, Koyutürk M: Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011. [http://bioinformatics.oxfordjournals.org/content/27/20/2790 webcite]
  • [6]Holtgrewe M, Emde AK, Weese D, Reinert K: A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinformatics 2011, 12:210. [http://www.biomedcentral.com/1471-2105/12/210 webcite] BioMed Central Full Text
  • [7]Schbath S, Martin V, Zytnicki M, Fayolle J, Loux V, Gibrat JF: Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis. J Comput Biol 2012, 19(6):796-813.
  • [8]Hatem A, Bozdag D, Toland AE, Catalyurek UV: Benchmarking short sequence mapping tools. BMC Bioinformatics 2013, 14:184. BioMed Central Full Text
  • [9]Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008, 36(16):e105.
  • [10]Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009, 10(3):R32. BioMed Central Full Text
  • [11]Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L: Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 2011, 12:451. BioMed Central Full Text
  • [12]Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ: Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012, 30(5):434-439.
  • [13]Torok ME, Peacock SJ: Rapid whole-genome sequencing of bacterial pathogens in the clinical microbiology laboratory–pipe dream or reality? J Antimicrob Chemother 2012, 67(10):2307-2308.
  • [14]Sherry NL, Porter JL, Seemann T, Watkins A, Stinear TP, Howden BP: Outbreak investigation using high-throughput genome sequencing within a diagnostic microbiology laboratory. J Clin Microbiol 2013, 51(5):1396-1401.
  • [15]Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. eprint. 2012. (arXiv:1207.3907v2)
  • [16]Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9(4):357-359.
  • [17]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
  • [18]Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010, 26(5):589-595.
  • [19]Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 2010, 26(7):873-881. [http://bioinformatics.oxfordjournals.org/content/26/7/873.abstract webcite]
  • [20]Lee W, Stromberg M, Ward A, Stewart C, Garrison E, Marth G: MOSAIK: a hash-based algorithm for accurate next-generation sequencing read mapping. eprint arXiv:1309.1149 2013. [http://arxiv.org/abs/1309.1149 webcite]
  • [21]Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, Manavski S, Vitulo N, Valle G: PASS: a program to align short sequences. Bioinformatics 2009, 25(7):967-968. [http://bioinformatics.oxfordjournals.org/content/25/7/967.abstract webcite]
  • [22]Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermuller J: Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 2009, 5(9):e1000502.
  • [23]David M, Dzamba M, Lister D, Ilie L, Brudno M: SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 2011, 27(7):1011-1012.
  • [24]Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson DA, Shenker S, Stoica I, Karp RM, Sittler T: Faster and more accurate sequence alignment with SNAP. CoRR 2011, abs/1111.5572.
  • [25]Gontarz PM, Berger J, Wong CF: SRmapper: a fast and sensitive genome-hashing alignment tool. Bioinformatics 2012. [http://bioinformatics.oxfordjournals.org/content/early/2012/12/20/bioinformatics.bts712.abstract webcite]
  • [26]Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res 2001, 11(10):1725-1729.
  • [27]Homer N, Merriman B: TMAP: the Torrent Mapping Alignment Program. [https://github.com/iontorrent/TMAP webcite]
  • [28]Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, Zody MC, Erlich RL, Green LM, Berical A, Wang Y, Casali M, Streeck H, Bloom AK, Dudek T, Tully D, Newman R, Axten KL, Gladden AD, Battis L, Kemper M, Zeng Q, Shea TP, Gujja S, Zedlack C, Gasser O, et al.: Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog 2012, 8(3):e1002529.
  • [29]Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M: Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012, 2012:251364.
  • [30]Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012, 13:341. BioMed Central Full Text
  文献评价指标  
  下载次数:56次 浏览次数:7次