科技报告详细信息
Feasibility of N-Gram Data-Structures for Next-Generation Pathogen Signature Design | |
Gardner, S N | |
关键词: DESIGN; DETECTION; FORECASTING; LAWRENCE LIVERMORE NATIONAL LABORATORY; PATHOGENS; STORAGE; VELOCITY; | |
DOI : 10.2172/947229 RP-ID : LLNL-TR-410145 PID : OSTI ID: 947229 Others : TRN: US200906%%51 |
|
美国|英语 | |
来源: SciTech Connect | |
![]() |
【 摘 要 】
We determined the most appropriate data structure for handling n-gram (also known as k-mer) string comparisons and storage for genomic sequence data that will scale in terms of memory and speed. This is critical to maintain LLNL as the leader in pathogen detection, as it will guide the design of the 'Next Generation' system for computational signature prediction. There are two parts to k-mer analysis for signature prediction that we investigated. First is the enumeration and frequency counting of all observed k-mers in a sequence database (k-mer is a biological term equivalent to the CS term n-gram). Second is the down-selection and pairing of k-mers to generate a signature. We determined that for the first part, suffix arrays are the preferred method to enumerate k-mers, being memory efficient and relatively easy and fast to compute. For the second part, a subset of the k-mers can be stored and manipulated in a hash, that subset determination based on desired frequency characteristics such as most/least frequent from a set, shared among sequence sets, or discriminating across sequence sets.【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201705170003024LZ | 497KB | ![]() |