学位论文详细信息
High Performance Parallel and Distributed Genomic Sequence Search
parallel I/O;scheduling;distributed computing;parallel bioinformatics;sequence database search
Lin, Heshan ; Xiaosong Ma, Committee Chair,Steffen Heber, Committee Member,Frank Mueller, Committee Member,Nagiza Samatova, Committee Member,Douglas Reeves, Committee Member,Lin, Heshan ; Xiaosong Ma ; Committee Chair ; Steffen Heber ; Committee Member ; Frank Mueller ; Committee Member ; Nagiza Samatova ; Committee Member ; Douglas Reeves ; Committee Member
University:North Carolina State University
关键词: parallel I/O;    scheduling;    distributed computing;    parallel bioinformatics;    sequence database search;   
Others  :  https://repository.lib.ncsu.edu/bitstream/handle/1840.16/3481/etd.pdf?sequence=1&isAllowed=y
美国|英语
来源: null
PDF
【 摘 要 】

Genomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information.In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale;3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer.

【 预 览 】
附件列表
Files Size Format View
High Performance Parallel and Distributed Genomic Sequence Search 953KB PDF download
  文献评价指标  
  下载次数:27次 浏览次数:20次