期刊论文详细信息
PeerJ
ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
article
Natasha Pavlovikj1  Joao Carlos Gomes-Neto2  Jitender S. Deogun1  Andrew K. Benson2 
[1] Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln;Department of Food Science and Technology, University of Nebraska-Lincoln, Lincoln;Nebraska Food for Health Center, University of Nebraska-Lincoln, Lincoln
关键词: Bacteria;    Population-genomics;    Pan-genome;    High-performance computing;    High-throughput computing;    Scalability;    Workflow-management system;    Pipeline;   
DOI  :  10.7717/peerj.11376
学科分类:社会科学、人文和艺术(综合)
来源: Inra
PDF
【 摘 要 】

Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.

【 授权许可】

CC BY   

【 预 览 】
附件列表
Files Size Format View
RO202307100005964ZK.pdf 8409KB PDF download
  文献评价指标  
  下载次数:21次 浏览次数:5次