科技报告

【摘要】

We determined the most appropriate data structure for handling n-gram (also known as k-mer) string comparisons and storage for genomic sequence data that will scale in terms of memory and speed. This is critical to maintain LLNL as the leader in pathogen detection, as it will guide the design of the 'Next Generation' system for computational signature prediction. There are two parts to k-mer analysis for signature prediction that we investigated. First is the enumeration and frequency counting of all observed k-mers in a sequence database (k-mer is a biological term equivalent to the CS term n-gram). Second is the down-selection and pairing of k-mers to generate a signature. We determined that for the first part, suffix arrays are the preferred method to enumerate k-mers, being memory efficient and relatively easy and fast to compute. For the second part, a subset of the k-mers can be stored and manipulated in a hash, that subset determination based on desired frequency characteristics such as most/least frequent from a set, shared among sequence sets, or discriminating across sequence sets.

【预览】

附件列表
Files	Size	Format	View
RO201705170003024LZ	497KB	PDF	download


Feasibility of N-Gram Data-Structures for Next-Generation Pathogen Signature Design

Gardner, S N
关键词: DESIGN; DETECTION; FORECASTING; LAWRENCE LIVERMORE NATIONAL LABORATORY; PATHOGENS; STORAGE; VELOCITY;
DOI : 10.2172/947229 RP-ID : LLNL-TR-410145 PID : OSTI ID: 947229 Others : TRN: US200906%%51
美国\|英语
来源: SciTech Connect
PDF


	文献评价指标
	下载次数：3次	浏览次数：12次

【 摘 要 】

【 预 览 】

【摘要】

【预览】