期刊论文详细信息
Journal of computational biology: A journal of computational molecular cell biology
Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees
BradSolomon1 
关键词: data indexing;    RNA-seq;    sequence bloom trees;    sequence search.;   
DOI  :  10.1089/cmb.2017.0265
学科分类:生物科学(综合)
来源: Mary Ann Liebert, Inc. Publishers
PDF
【 摘 要 】

Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. We apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO201910251179203ZK.pdf 656KB PDF download
  文献评价指标  
  下载次数:15次 浏览次数:4次