科技报告详细信息
Feasibility of N-Gram Data-Structures for Next-Generation Pathogen Signature Design
Gardner, S N
关键词: DESIGN;    DETECTION;    FORECASTING;    LAWRENCE LIVERMORE NATIONAL LABORATORY;    PATHOGENS;    STORAGE;    VELOCITY;   
DOI  :  10.2172/947229
RP-ID  :  LLNL-TR-410145
PID  :  OSTI ID: 947229
Others  :  TRN: US200906%%51
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】
We determined the most appropriate data structure for handling n-gram (also known as k-mer) string comparisons and storage for genomic sequence data that will scale in terms of memory and speed. This is critical to maintain LLNL as the leader in pathogen detection, as it will guide the design of the 'Next Generation' system for computational signature prediction. There are two parts to k-mer analysis for signature prediction that we investigated. First is the enumeration and frequency counting of all observed k-mers in a sequence database (k-mer is a biological term equivalent to the CS term n-gram). Second is the down-selection and pairing of k-mers to generate a signature. We determined that for the first part, suffix arrays are the preferred method to enumerate k-mers, being memory efficient and relatively easy and fast to compute. For the second part, a subset of the k-mers can be stored and manipulated in a hash, that subset determination based on desired frequency characteristics such as most/least frequent from a set, shared among sequence sets, or discriminating across sequence sets.
【 预 览 】
附件列表
Files Size Format View
RO201705170003024LZ 497KB PDF download
  文献评价指标  
  下载次数:1次 浏览次数:12次