科技报告详细信息
Final Report PIPELINING RDP DATA TO THE "TAXOMATIC"
Garrity, George
关键词: Self-Organizing Self-Correcting Classification (SOSCC) algorithms;    identification of annotation errors;    detection of unresolved synonymies;    taxonomic and nomenclatural errors;    accelerate the production and distribution of the updated versions of the prokaryotic taxonomy in lock-step with publication of new taxa and rearrangement of existing taxa;    interactive web application;    phylogenetic and genomic information;    access sequence data;    scientific literature;   
DOI  :  10.2172/1053503
RP-ID  :  DOE -MICHIGAN STATE- 63933
PID  :  OSTI ID: 1053503
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

This project builds on the results of previously funded research by integrating data and software that had been previously used in building resources used in the preparation of Bergey?s Manual of Systematic Bacteriology, 2nd Edition (Volumes 1 & 2A-C) and the Ribosomal Database Project-II (RDP-II) so as to both enhance the value of the data and create a pipeline approach to keeping the data current. Earlier, we demonstrated the value of using exploratory data analysis (EDA) to visualize large sets of sequence data (notably SSU rRNA gene sequences used in constructing a comprehensive phylogeny of prokaryotes. While the Self- Organizing Self-Correcting Classification (SOSCC) algorithms we developed were computationally efficient and useful for unraveling problems within the underlying data (e.g., identification of annotation errors, detection of unresolved synonymies, taxonomic and nomenclatural errors), bottlenecks at the preprocessing stage limited deployment of our applications as tools for end-users. To overcome the bottlenecks (which included hand alignment and computation of large matrices of pair-wise evolutionary distances), we proposed building a data pipeline between the ?Taxomatic? application and RDP-II. The objectives were to accelerate the production and distribution of the updated versions of the prokaryotic taxonomy in lock-step with publication of new taxa and rearrangement of existing taxa, and to distribute these data more readily with the RDP-II and and other stakeholders in the community. A related goal of the current project is to deploy our visualization techniques as an interactive web application by which end-users can view manipulate, and select datasets of particular interest based upon phylogenetic and genomic information, access sequence data, and ultimately the scientific literature where the original observations were made and those that build on the original observations. The Taxomatic is a web-based tool to visualize distance matrices. The tool accepts raw distance matrices or aligned sequence information as data sources. When sequence information is provided the distance matrix is computed using the uncorrected distance model. Users can upload files to the Taxomatic website or sequences can be submitted by a SOAP service. This SOAP service is used by RDP to streamline Taxomatic use with RDP data. In addition to supplying source information, users can either supply their own taxonomic information by uploading it in XML, retrieve data taxonomic information from the RDP using either RDP or Genbank identifiers as source data, with or without classification by the RDP Classifier web service, or completely omit taxonomic data. In the latter case, the input distance matrix can be viewed in the order in which it was loaded.

【 预 览 】
附件列表
Files Size Format View
RO201705170001863LZ 16KB PDF download
  文献评价指标  
  下载次数:14次 浏览次数:14次