期刊论文详细信息
BMC Bioinformatics
S-conLSH: alignment-free gapped mapping of noisy long reads
Burkhard Morgenstern1  Angana Chakraborty2  Sanghamitra Bandyopadhyay3 
[1] Department of Bioinformatics (IMG), University of Göttingen, 37077, Göttingen, Germany;Department of Computer Science, West Bengal Education Service, Kolkata, India;Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India;
关键词: Sequence analysis;    Alignment-free sequence comparison;    Noisy long SMRT reads;    Locality sensitive hashing;   
DOI  :  10.1186/s12859-020-03918-3
来源: Springer
PDF
【 摘 要 】

BackgroundThe advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate.ResultsWe present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing.ConclusionsS-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202107036484035ZK.pdf 2284KB PDF download
  文献评价指标  
  下载次数:4次 浏览次数:4次