Automated Data Cleansing in Data Harvesting and Data Migration | |
Martin, Mark ; Vowell, Lance ; King, Ian ; Augustus, Chris (ORCID:0000000172972325) | |
关键词: Duplicate Removal; unstructured data; structured data Secure Hashing Algorithm; SHA-1; Latent Semantic Indexing; LSI; information technology; knowledge management; bibliographic data; | |
DOI : 10.2172/949761 RP-ID : 07ER84709 Final Report PID : OSTI ID: 949761 |
|
学科分类:数学(综合) | |
美国|英语 | |
来源: SciTech Connect | |
【 摘 要 】
In the proposal for this project, we noted how the explosion of digitized information available through corporate databases, data stores and online search systems has resulted in the knowledge worker being bombarded by information. Knowledge workers typically spend more than 20-30% of their time seeking and sorting information, only finding the information 50-60% of the time . This information exists as unstructured, semi-structured and structured data. The problem of information overload is compounded by the production of duplicate or near-duplicate information. In addition, near-duplicate items frequently have different origins, creating a situation in which each item may have unique information of value, but their differences are not significant enough to justify maintaining them as separate entities. Effective tools can be provided to eliminate duplicate and near-duplicate information. The proposed approach was to extract unique information from data sets and consolidation that information into a single comprehensive file.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201704210002796LZ | 398KB | download |