BMC Genomics,2016年
Joshua McMichael, Christopher A. Miller, Richard K. Wilson, Ha X. Dang, Li Ding, Christopher A. Maher, Timothy J. Ley, Elaine R. Mardis
LicenseType:CC BY |
BackgroundMassively-parallel sequencing at depth is now enabling tumor heterogeneity and evolution to be characterized in unprecedented detail. Tracking these changes in clonal architecture often provides insight into therapeutic response and resistance. In complex cases involving multiple timepoints, standard visualizations, such as scatterplots, can be difficult to interpret. Current data visualization methods are also typically manual and laborious, and often only approximate subclonal fractions.ResultsWe have developed an R package that accurately and intuitively displays changes in clonal structure over time. It requires simple input data and produces illustrative and easy-to-interpret graphs suitable for diagnosis, presentation, and publication.ConclusionsThe simplicity, power, and flexibility of this tool make it valuable for visualizing tumor evolution, and it has potential utility in both research and clinical settings. The fishplot package is available at https://github.com/chrisamiller/fishplot.
2 RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants [期刊论文]
BMC Genomics,2016年
Frank M. You, Pingchuan Li, Xiande Quan, Jin Xiao, Gaofeng Jia, Sylvie Cloutier
LicenseType:CC BY |
BackgroundResistance gene analogs (RGAs), such as NBS-encoding proteins, receptor-like protein kinases (RLKs) and receptor-like proteins (RLPs), are potential R-genes that contain specific conserved domains and motifs. Thus, RGAs can be predicted based on their conserved structural features using bioinformatics tools. Computer programs have been developed for the identification of individual domains and motifs from the protein sequences of RGAs but none offer a systematic assessment of the different types of RGAs. A user-friendly and efficient pipeline is needed for large-scale genome-wide RGA predictions of the growing number of sequenced plant genomes.ResultsAn integrative pipeline, named RGAugury, was developed to automate RGA prediction. The pipeline first identifies RGA-related protein domains and motifs, namely nucleotide binding site (NB-ARC), leucine rich repeat (LRR), transmembrane (TM), serine/threonine and tyrosine kinase (STTK), lysin motif (LysM), coiled-coil (CC) and Toll/Interleukin-1 receptor (TIR). RGA candidates are identified and classified into four major families based on the presence of combinations of these RGA domains and motifs: NBS-encoding, TM-CC, and membrane associated RLP and RLK. All time-consuming analyses of the pipeline are paralleled to improve performance. The pipeline was evaluated using the well-annotated Arabidopsis genome. A total of 98.5, 85.2, and 100 % of the reported NBS-encoding genes, membrane associated RLPs and RLKs were validated, respectively. The pipeline was also successfully applied to predict RGAs for 50 sequenced plant genomes. A user-friendly web interface was implemented to ease command line operations, facilitate visualization and simplify result management for multiple datasets.ConclusionsRGAugury is an efficiently integrative bioinformatics tool for large scale genome-wide identification of RGAs. It is freely available at Bitbucket: https://bitbucket.org/yaanlpc/rgaugury.
BMC Genomics,2016年
Kelly A. Brayton, Assefaw H. Gebremedhin, Helen N. Catanese
LicenseType:CC BY |
BackgroundShort-sequence repeats (SSRs) occur in both prokaryotic and eukaryotic DNA, inter- and intragenically, and may be exact or inexact copies. When heterogeneous SSRs are present in a given locus, we can take advantage of the pattern of different repeats to genotype strains based on the SSRs. Cataloguing and tracking these repeats can be difficult as diverse groups of researchers are involved in the identification of the repeats. Additionally, the task is error-prone when done manually.ResultsWe developed RepeatAnalyzer, a new software tool capable of tracking, managing, analysing and cataloguing SSRs and genotypes using Anaplasma marginale as a model species. RepeatAnalyzer’s analysis capability includes novel metrics for measuring regional genetic diversity (corresponding to variety and regularity of SSR occurrence). As a part of its visualization capabilities, RepeatAnalyzer produces high quality maps of the geographic distribution of genotypes or SSRs over a region of interest. RepeatAnalyzer’s repeat identification functionality was validated for all SSRs and genotypes reported in 21 publications, using 380 A. marginale isolates gathered from the five publications within that list that provided access to their isolates. The tool produced accurate genotyping results in every case. In addition, it uncovered a number of errors in the published literature: 11 cases where SSRs were misreported, 5 cases where two different SSRs had been given the same name, and 16 cases where two or more names had been given to a single SSR. The analysis and visualization functionalities of the tool are demonstrated using several examples.ConclusionsRepeatAnalyzer is a robust software tool that can be used for storing, managing, and analysing short-sequence repeats for the purpose of strain identification. The tool can be used for any set of SSRs regardless of species. When applied to A. marginale, our test case, we show that genotype lengths for a given region follow a normal distribution, while SSR frequencies follow a power-law-like distribution. Further, we find that over 90 % of repeats are 28 to 29 amino acids long, which is in agreement with conventional wisdom. Lastly, our analysis reveals that the most common edit distance is five or six, which is counter-intuitive since we expected that result to be closer to one, resulting from the simplest change from one repeat to another.
BMC Genomics,2016年
Elena Biagi, Matteo Soverini, Patrizia Brigidi, Marco Candela, Silvia Turroni, Simone Rampelli, Sara Quercia
LicenseType:CC BY |
BackgroundBioinformatics tools available for metagenomic sequencing analysis are principally devoted to the identification of microorganisms populating an ecological niche, but they usually do not consider viruses. Only some software have been designed to profile the viral sequences, however they are not efficient in the characterization of viruses in the context of complex communities, like the intestinal microbiota, containing bacteria, archeabacteria, eukaryotic microorganisms and viruses. In any case, a comprehensive description of the host-microbiota interactions can not ignore the profile of eukaryotic viruses within the virome, as viruses are definitely critical for the regulation of the host immunophenotype.ResultsViromeScan is an innovative metagenomic analysis tool that characterizes the taxonomy of the virome directly from raw data of next-generation sequencing. The tool uses hierarchical databases for eukaryotic viruses to unambiguously assign reads to viral species more accurately and >1000 fold faster than other existing approaches. We validated ViromeScan on synthetic microbial communities and applied it on metagenomic samples of the Human Microbiome Project, providing a sensitive eukaryotic virome profiling of different human body sites.ConclusionsViromeScan allows the user to explore and taxonomically characterize the virome from metagenomic reads, efficiently denoising samples from reads of other microorganisms. This implies that users can fully characterize the microbiome, including bacteria and viruses, by shotgun metagenomic sequencing followed by different bioinformatic pipelines.
BMC Genomics,2016年
R. Alexander Richter, Weizhong Li, Yunsup Jung, Qiyun Zhu, Robert W. Li
LicenseType:CC BY |
BackgroundRemarkable advances in Next Generation Sequencing (NGS) technologies, bioinformatics algorithms and computational technologies have significantly accelerated genomic research. However, complicated NGS data analysis still remains as a major bottleneck. RNA-seq, as one of the major area in the NGS field, also confronts great challenges in data analysis.ResultsTo address the challenges in RNA-seq data analysis, we developed a web portal that offers three integrated workflows that can perform end-to-end compute and analysis, including sequence quality control, read-mapping, transcriptome assembly, reconstruction and quantification, and differential analysis. The first workflow utilizes Tuxedo (Tophat, Cufflink, Cuffmerge and Cuffdiff suite of tools). The second workflow deploys Trinity for de novo assembly and uses RSEM for transcript quantification and EdgeR for differential analysis. The third combines STAR, RSEM, and EdgeR for data analysis. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission. The calculated results are available for download and post-analysis. The supported animal species include chicken, cow, duck, goat, pig, horse, rabbit, sheep, turkey, as well as several other model organisms including yeast, C. elegans, Drosophila, and human, with genomic sequences and annotations obtained from ENSEMBL.The RNA-seq portal is freely available from http://weizhongli-lab.org/RNA-seq.ConclusionsThe web portal offers not only bioinformatics software, workflows, computation and reference data, but also an integrated environment for complex RNA-seq data analysis for agricultural animal species. In this project, our aim is not to develop new RNA-seq tools, but to build web workflows for using popular existing RNA-seq methods and make these tools more accessible to the communities.
BMC Genomics,2016年
Joshua G. Dunn, Jonathan S. Weissman
LicenseType:CC BY |
BackgroundNext-generation sequencing (NGS) informs many biological questions with unprecedented depth and nucleotide resolution. These assays have created a need for analytical tools that enable users to manipulate data nucleotide-by-nucleotide robustly and easily. Furthermore, because many NGS assays encode information jointly within multiple properties of read alignments ― for example, in ribosome profiling, the locations of ribosomes are jointly encoded in alignment coordinates and length ― analytical tools are often required to extract the biological meaning from the alignments before analysis. Many assay-specific pipelines exist for this purpose, but there remains a need for user-friendly, generalized, nucleotide-resolution tools that are not limited to specific experimental regimes or analytical workflows.ResultsPlastid is a Python library designed specifically for nucleotide-resolution analysis of genomics and NGS data. As such, Plastid is designed to extract assay-specific information from read alignments while retaining generality and extensibility to novel NGS assays. Plastid represents NGS and other biological data as arrays of values associated with genomic or transcriptomic positions, and contains configurable tools to convert data from a variety of sources to such arrays.Plastid also includes numerous tools to manipulate even discontinuous genomic features, such as spliced transcripts, with nucleotide precision. Plastid automatically handles conversion between genomic and feature-centric coordinates, accounting for splicing and strand, freeing users of burdensome accounting. Finally, Plastid’s data models use consistent and familiar biological idioms, enabling even beginners to develop sophisticated analytical workflows with minimal effort.ConclusionsPlastid is a versatile toolkit that has been used to analyze data from multiple NGS assays, including RNA-seq, ribosome profiling, and DMS-seq. It forms the genomic engine of our ORF annotation tool, ORF-RATER, and is readily adapted to novel NGS assays. Examples, tutorials, and extensive documentation can be found at https://plastid.readthedocs.io.