学位论文详细信息
The Impact of Near-Duplicate Documents on Information Retrieval Evaluation
near-duplicate detection;MapReduce;shingles;Computer Science
Khoshdel Nikkhoo, Hani
University of Waterloo
关键词: near-duplicate detection;    MapReduce;    shingles;    Computer Science;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/5750/1/Khoshdel%20Nikkhoo_Hani.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

Near-duplicate documents can adversely affect the efficiency andeffectiveness of search engines.Due to the pairwise nature of the comparisons required for near-duplicatedetection, this process is extremely costly in terms of the time andprocessing power it requires.Despite the ubiquitous presence of near-duplicate detection algorithmsin commercial search engines, their application and impact in researchenvironments is not fully explored.The implementation of near-duplicate detection algorithms forces trade-offsbetween efficiency and effectiveness, entailing careful testing andmeasurement to ensure acceptable performance.In this thesis, we describe and evaluate a scalable implementation of anear-duplicate detection algorithm, based on standard shingling techniques,running under a MapReduce framework.We explore two different shingle sampling techniques and analyzetheir impact on the near-duplicate document detection process.In addition, we investigate the prevalence of near-duplicate documentsin the runs submitted to the adhoc task of TREC 2009 web track.

【 预 览 】
附件列表
Files Size Format View
The Impact of Near-Duplicate Documents on Information Retrieval Evaluation 1454KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:12次