科技报告详细信息
Challenges in Microbial Database Interoperability Interagency Microbe Project Working Group
Critchlow, Terence
Lawrence Livermore National Laboratory
关键词: Data Analysis;    Documentation;    Communities;    99 General And Miscellaneous//Mathematics, Computing, And Information Science;   
DOI  :  10.2172/15005933
RP-ID  :  UCRL-ID-146327
RP-ID  :  W-7405-ENG-48
RP-ID  :  15005933
美国|英语
来源: UNT Digital Library
PDF
【 摘 要 】

Currently, data of interest to microbial researchers is spread across hundreds of web-accessible data sources, each with a unique interface and data format. Researchers interact with a few of these sites when they analyze their data, but are not able to utilize the majority of them on a regular basis. There are two significant challenges that must be overcome to integrate this environment and allow researchers to efficiently perform data analysis across the entire set of relevant data, or at least a significant portion of it. The first is to provide consistent access to the large numbers of distributed, heterogeneous data sets that are currently distributed over the web. The second is to define the semantics of the data provided by the individual sites in such a way that semantic conflicts can be identified and, ideally, resolved. The first step in establishing any integrated environment, from a data warehouse to a multi-database system, is provide consistent access to all of the relevant sources. While the type of access required will vary based on the integration strategy chosen--for example federated systems use query-based access while warehouses may prefer access to the underlying database--the essence of this challenge remains the same. Thus, without sacrificing generality, the remainder of this discussion focuses on query-based access. Each data source independently determines the queries that it supports, how it will answer them, and the interface that it will use to make them. Even when the same query capability is provided by different sources the details of the interface are usually different. For example, while many sequence data sources support blast searches, they differ in the parameter names, available options, script locations, etc. These differences are not restricted solely to input parameters; the query results returned by different sources also vary dramatically, with some sources returning XML, others preformatted text, and still others a variety of formats. This set of disparate interfaces makes developing an integrated environment extremely challenging because a specialized wrapper needs to be created for each data source. Once consistent data access has been provided, the next challenge is to provide a semantically and syntactically consistent environment for the scientists to use. This would allow them to smoothly transfer data between different query interfaces and applications. Unfortunately, this is an even more daunting task than providing data access because resolving semantic differences between data sources first requires understanding the semantics being used by them. Currently, a source's semantic description of its data is usually buried in its documentation, if it is provided at all. As a result, scientists have become adept at simply looking at the data being provided and divining a first-order approximation of the semantics used by the source. Often, this approximation is sufficient for the types of queries that are being asked. However, when precise semantics are needed, a tedious and time-consuming search must be undertaken. Fortunately, some communities are becoming aware of this problem and are developing ontologies that overcome it by precisely defining the semantics of commonly used terms. While this simplifies data integration for those sources that adhere to a specific ontology, the definition of a single ontology for the entire domain of genomics remains a (probably unachievable) dream. Resolving syntactic differences is relatively straight-forward once the semantic ones have been resolved.

【 预 览 】
附件列表
Files Size Format View
15005933.pdf 141KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:53次