期刊论文详细信息
BMC Bioinformatics
ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering
Bie Verbist3  Lieven Clement2  Joke Reumers4  Kim Thys4  Alexander Vapirev1  Willem Talloen4  Yves Wetzels4  Joris Meys3  Jeroen Aerssens4  Luc Bijnens4  Olivier Thas5 
[1] ExaScience Life Lab, Kapeldreef 75, Leuven 3001, Belgium
[2] Department of Applied Mathematics, Informatics and Statistics, Ghent University, Krijgslaan 281 S9, Gent 9000, Belgium
[3] Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, Gent 9000, Belgium
[4] Janssen R&D, Janssen Pharmaceutical Companies of J&J, Turnhoutseweg 30, Beerse 2340, Belgium
[5] University of Wollongong, National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, NSW 2522, Australia
关键词: Viral quasispecies;    Model-based clustering;    Second best base call;    Codon;    Illumina sequencing;   
Others  :  1160562
DOI  :  10.1186/s12859-015-0458-7
 received in 2014-07-01, accepted in 2014-12-16,  发布年份 2015
PDF
【 摘 要 】

Background

Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses.

Results

Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step.

Conclusions

ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.

【 授权许可】

   
2015 Verbist et al.; licensee BioMed Central.

【 预 览 】
附件列表
Files Size Format View
20150411015332465.pdf 888KB PDF download
Figure 4. 23KB Image download
Figure 3. 41KB Image download
Figure 2. 30KB Image download
Figure 1. 30KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 2008, 36(16):105.
  • [2]Beerenwinkel N, Günthard HF, Roth V, Metzner KJ: Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol. 2012, 3:329.
  • [3]Eriksson N, Pachter L, Mitsuya Y, Rhee S-Y, Wang C, Gharizadeh B, et al.: Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008, 4(5):1000074.
  • [4]Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N: Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. J Comput Biol. 2010, 17(3):417-28.
  • [5]Zagordi O, Bhattacharya A, Eriksson N, Beerenwinkel N: Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinf. 2011, 12(1):119. BioMed Central Full Text
  • [6]Prosperi MC, Salemi M: Qure: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics. 2012, 28(1):132-3.
  • [7]Flaherty P, Natsoulis G, Muralidharan O, Winters M, Buenrostro J, Bell J, et al.: Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res. 2012, 40(1):e2.
  • [8]Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, et al.: Quality scores and snp detection in sequencing-by-synthesis systems. Genome Res. 2008, 18(5):763-70.
  • [9]Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 2008, 36(16):105.
  • [10]Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, et al. Lofreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012:918.
  • [11]Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, et al.: Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol. 2012, 8(3):1002417.
  • [12]Yang X, Charlebois P, Macalalad A, Henn MR, Zody MC: V-phaser 2: variant inference for viral populations. BMC Genomics. 2013, 14(1):674. BioMed Central Full Text
  • [13]Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ: Removing noise from pyrosequenced amplicons. BMC Bioinf. 2011, 12(1):38. BioMed Central Full Text
  • [14]Roche 454. http://www.genomeweb.com/sequencing/roche-shutting-down-454-sequencing-business.
  • [15]Ewing B, Green P: Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome Res. 1998, 8(3):186-94.
  • [16]De Beuf K, Schrijver JD, Thas O, Criekinge WV, Irizarry RA, Clement L: Improved base-calling and quality scores for 454 sequencing based on a hurdle poisson model. BMC Bioinf. 2012, 13(1):303. BioMed Central Full Text
  • [17]Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456(7218):53-9.
  • [18]Bravo HC, Irizarry RA: Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics. 2010, 66(3):665-74.
  • [19]Abnizova I, Leonard S, Skelly T, Brown A, Jackson D, Gourtovaia M, et al.: Analysis of context-dependent errors for illumina sequencing. J Bioinf Comput Biol. 2012, 10(02):1241005.
  • [20]Manual Illumina. http://supportres.illumina.com/documents/myillumina/ec3129a6-b41f-4d98-963f-668391997f1a/olb_194_userguide_15009920d.pdf.
  • [21]Li H, Durbin R: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009, 25(14):1754-60.
  • [22]Schirmer M, Sloan WT, Quince C: Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes. Brief Bioinf. 2014, 15(3):431-42.
  • [23]McLachlan G, Krishnan T: The EM Algorithm and Extensions. vol. 382. John Wiley & Sons, Inc., Hoboken, New Jersey; 2007.
  • [24]Fraley C, Raftery AE: Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002, 97(458):611-31.
  • [25]Asselah T, Marcellin P: New direct-acting antivirals’ combination for the treatment of chronic hepatitis c. Liver International. 2011, 31(s1):68-77.
  • [26]Zagordi O, Klein R, Däumer M, Beerenwinkel N: Error correction of next-generation sequencing data and reliable estimation of hiv quasispecies. Nucleic Acids Res. 2010, 38(21):7400-9.
  • [27]Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al.: Whole genome deep sequencing of hiv-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathogens. 2012, 8(3):1002529.
  • [28]Local variants. https://github.com/ozagordi/localvariants.
  • [29]Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and snp calling from next-generation sequencing data. Nat Rev Genet. 2011, 12(6):443-51.
  • [30]Vandenhende M-A, Bellecave P, Recordon-Pinson P, Reigadas S, Bidet Y, Bruyand M, et al.: Prevalence and evolution of low frequency hiv drug resistance mutations detected by ultra deep sequencing in patients experiencing first line antiretroviral therapy failure. PloS One. 2014, 9(1):86771.
  • [31]Halfon P, Locarnini S: Hepatitis c virus resistance to protease inhibitors. J Hepatol. 2011, 55(1):192-206.
  文献评价指标  
  下载次数:85次 浏览次数:32次