学位论文详细信息
Parallel and scalable combinatorial string algorithms on distributed memory systems
Distributed memory;Suffix array;Suffix tree;Enhanced suffix arrays;De Bruijn graphs;Distributed graphs;Connected components
Flick, Patrick ; Aluru, Srinivas Computational Science and Engineering Catalyurek, Umit V. Vuduc, Richard W. Yalamanchili, Sudhakar Ucar, Bora ; Aluru, Srinivas
University:Georgia Institute of Technology
Department:Computational Science and Engineering
关键词: Distributed memory;    Suffix array;    Suffix tree;    Enhanced suffix arrays;    De Bruijn graphs;    Distributed graphs;    Connected components;   
Others  :  https://smartech.gatech.edu/bitstream/1853/61257/1/FLICK-DISSERTATION-2019.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

Methods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.

【 预 览 】
附件列表
Files Size Format View
Parallel and scalable combinatorial string algorithms on distributed memory systems 1153KB PDF download
  文献评价指标  
  下载次数:24次 浏览次数:26次