期刊论文详细信息
BMC Bioinformatics
Repeat-aware modeling and correction of short read errors
Research
Xiao Yang1  Srinivas Aluru2  Karin S Dorman3 
[1] Department of Electrical and Computer Engineering, Iowa State University, 50011, Ames, Iowa, USA;Department of Electrical and Computer Engineering, Iowa State University, 50011, Ames, Iowa, USA;Department of Computer Science and Engineering, Indian Institute of Technology Bombay, 400 076, Mumbai, Maharashtra, India;Department of Statistics and Department of Genetics, Development & Cell Biology, Iowa State University, 50011, Ames, Iowa, USA;
关键词: Reference Genome;    Error Detection;    Short Read;    Illumina Genome Analyzer;    Repeat Content;   
DOI  :  10.1186/1471-2105-12-S1-S52
来源: Springer
PDF
【 摘 要 】

BackgroundHigh-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of k mers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous k mer may be frequently observed if it has few nucleotide differences with valid k mers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.ResultsWe develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of k mers from their observed frequencies by analyzing the misread relationships among observed k mers. We also propose a method to estimate the threshold useful for validating k mers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.ConclusionsWe introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.

【 授权许可】

Unknown   
© Yang et al; licensee BioMed Central Ltd. 2011. This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

【 预 览 】
附件列表
Files Size Format View
RO202311100744342ZK.pdf 765KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  文献评价指标  
  下载次数:0次 浏览次数:0次