期刊论文详细信息
BMC Bioinformatics
Empirical estimation of sequencing error rates using smoothing splines
Methodology Article
Bo Peng1  Jian Wang2  Xuan Zhu2  Sanjay Shete3 
[1] Department of Bioinformatics & Computational Biology, The University of Texas MD Anderson Cancer Center, 77030, Houston, TX, USA;Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 77030, Houston, TX, USA;Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 77030, Houston, TX, USA;Department of Epidemiology, The University of Texas MD Anderson Cancer Center, 77030, Houston, TX, USA;
关键词: Empirical error rate;    Next-generation sequencing;    Smoothing spline;    Frequency-based simulation;    Short reads;   
DOI  :  10.1186/s12859-016-1052-3
 received in 2015-08-18, accepted in 2016-04-14,  发布年份 2016
来源: Springer
PDF
【 摘 要 】

BackgroundNext-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.ResultsWe performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.ConclusionsThe proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.

【 授权许可】

CC BY   
© Zhu et al. 2016

【 预 览 】
附件列表
Files Size Format View
RO202311096618060ZK.pdf 466KB PDF download
12864_2017_4132_Article_IEq5.gif 1KB Image download
12864_2017_4030_Article_IEq26.gif 1KB Image download
12864_2017_4132_Article_IEq7.gif 1KB Image download
12864_2016_2793_Article_IEq13.gif 1KB Image download
【 图 表 】

12864_2016_2793_Article_IEq13.gif

12864_2017_4132_Article_IEq7.gif

12864_2017_4030_Article_IEq26.gif

12864_2017_4132_Article_IEq5.gif

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  • [38]
  • [39]
  • [40]
  • [41]
  • [42]
  • [43]
  • [44]
  • [45]
  • [46]
  • [47]
  • [48]
  文献评价指标  
  下载次数:6次 浏览次数:1次