科技报告

【摘要】

In the proposal for this project, we noted how the explosion of digitized information available through corporate databases, data stores and online search systems has resulted in the knowledge worker being bombarded by information. Knowledge workers typically spend more than 20-30% of their time seeking and sorting information, only finding the information 50-60% of the time . This information exists as unstructured, semi-structured and structured data. The problem of information overload is compounded by the production of duplicate or near-duplicate information. In addition, near-duplicate items frequently have different origins, creating a situation in which each item may have unique information of value, but their differences are not significant enough to justify maintaining them as separate entities. Effective tools can be provided to eliminate duplicate and near-duplicate information. The proposed approach was to extract unique information from data sets and consolidation that information into a single comprehensive file.

【预览】

附件列表
Files	Size	Format	View
RO201704210002796LZ	398KB	PDF	download


Automated Data Cleansing in Data Harvesting and Data Migration

Martin, Mark ; Vowell, Lance ; King, Ian ; Augustus, Chris (ORCID:0000000172972325)
关键词: Duplicate Removal; unstructured data; structured data Secure Hashing Algorithm; SHA-1; Latent Semantic Indexing; LSI; information technology; knowledge management; bibliographic data;
DOI : 10.2172/949761 RP-ID : 07ER84709 Final Report PID : OSTI ID: 949761
学科分类：数学（综合）
美国\|英语
来源: SciTech Connect
PDF


	文献评价指标
	下载次数：10次	浏览次数：41次

【 摘 要 】

【 预 览 】

【摘要】

【预览】