期刊论文详细信息
BMC Bioinformatics
HALC: High throughput algorithm for long read error correction
Software
Lingxiao Lan1  Ergude Bao2 
[1]School of Software Engineering, Beijing Jiaotong University, 3 Shangyuan Residence, Haidian District, 100044, Beijing, China
[2]School of Software Engineering, Beijing Jiaotong University, 3 Shangyuan Residence, Haidian District, 100044, Beijing, China
[3]Department of Botany and Plant Sciences, University of California, Riverside, 900 University Ave., 92521, RiversideCA, USA
关键词: PacBio long reads;    Error correction;    Throughput;   
DOI  :  10.1186/s12859-017-1610-3
 received in 2016-10-27, accepted in 2017-03-24,  发布年份 2017
来源: Springer
PDF
【 摘 要 】
BackgroundThe third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis.ResultsHere, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms.ConclusionsThe HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.
【 授权许可】

CC BY   
© The Author(s) 2017

【 预 览 】
附件列表
Files Size Format View
RO202311090651046ZK.pdf 538KB PDF download
12864_2017_3783_Article_IEq2.gif 1KB Image download
12864_2017_3783_Article_IEq3.gif 1KB Image download
12864_2017_3783_Article_IEq4.gif 1KB Image download
12864_2016_2789_Article_IEq51.gif 1KB Image download
12711_2017_365_Article_IEq133.gif 1KB Image download
【 图 表 】

12711_2017_365_Article_IEq133.gif

12864_2016_2789_Article_IEq51.gif

12864_2017_3783_Article_IEq4.gif

12864_2017_3783_Article_IEq3.gif

12864_2017_3783_Article_IEq2.gif

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  文献评价指标  
  下载次数:0次 浏览次数:0次