科技报告详细信息
Automated Data Cleansing in Data Harvesting and Data Migration
Martin, Mark ; Vowell, Lance ; King, Ian ; Augustus, Chris (ORCID:0000000172972325)
关键词: Duplicate Removal;    unstructured data;    structured data Secure Hashing Algorithm;    SHA-1;    Latent Semantic Indexing;    LSI;    information technology;    knowledge management;    bibliographic data;   
DOI  :  10.2172/949761
RP-ID  :  07ER84709 Final Report
PID  :  OSTI ID: 949761
学科分类:数学(综合)
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

In the proposal for this project, we noted how the explosion of digitized information available through corporate databases, data stores and online search systems has resulted in the knowledge worker being bombarded by information. Knowledge workers typically spend more than 20-30% of their time seeking and sorting information, only finding the information 50-60% of the time . This information exists as unstructured, semi-structured and structured data. The problem of information overload is compounded by the production of duplicate or near-duplicate information. In addition, near-duplicate items frequently have different origins, creating a situation in which each item may have unique information of value, but their differences are not significant enough to justify maintaining them as separate entities. Effective tools can be provided to eliminate duplicate and near-duplicate information. The proposed approach was to extract unique information from data sets and consolidation that information into a single comprehensive file.

【 预 览 】
附件列表
Files Size Format View
RO201704210002796LZ 398KB PDF download
  文献评价指标  
  下载次数:10次 浏览次数:41次