科技报告详细信息
Finding Similar Files in Large Document Repositories
Forman, George ; Eshghi, Kave ; Chiocchetti, Stephane
HP Development Company
关键词: content management;    document management;    near duplicate detection;    similarity;    scalability;   
RP-ID  :  HPL-2005-42R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file- chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability. Notes: Copyright ACM. To be published in and presented at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), 21-25 August 2005, Chicago, IL, USA 7 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100000800LZ 478KB PDF download
  文献评价指标  
  下载次数:17次 浏览次数:23次