学位论文

【摘要】

Near-duplicate documents can adversely affect the efficiency andeffectiveness of search engines.Due to the pairwise nature of the comparisons required for near-duplicatedetection, this process is extremely costly in terms of the time andprocessing power it requires.Despite the ubiquitous presence of near-duplicate detection algorithmsin commercial search engines, their application and impact in researchenvironments is not fully explored.The implementation of near-duplicate detection algorithms forces trade-offsbetween efficiency and effectiveness, entailing careful testing andmeasurement to ensure acceptable performance.In this thesis, we describe and evaluate a scalable implementation of anear-duplicate detection algorithm, based on standard shingling techniques,running under a MapReduce framework.We explore two different shingle sampling techniques and analyzetheir impact on the near-duplicate document detection process.In addition, we investigate the prevalence of near-duplicate documentsin the runs submitted to the adhoc task of TREC 2009 web track.

【预览】

附件列表
Files	Size	Format	View
The Impact of Near-Duplicate Documents on Information Retrieval Evaluation	1454KB	PDF	download


The Impact of Near-Duplicate Documents on Information Retrieval Evaluation
near-duplicate detection;MapReduce;shingles;Computer Science
Khoshdel Nikkhoo, Hani
University of Waterloo
关键词: near-duplicate detection; MapReduce; shingles; Computer Science;
Others : https://uwspace.uwaterloo.ca/bitstream/10012/5750/1/Khoshdel%20Nikkhoo_Hani.pdf
瑞士\|英语
来源: UWSPACE Waterloo Institutional Repository
PDF


	文献评价指标
	下载次数：8次	浏览次数：13次

【 摘 要 】

【 预 览 】

【摘要】

【预览】