Near-duplicate documents can adversely affect the efficiency andeffectiveness of search engines.Due to the pairwise nature of the comparisons required for near-duplicatedetection, this process is extremely costly in terms of the time andprocessing power it requires.Despite the ubiquitous presence of near-duplicate detection algorithmsin commercial search engines, their application and impact in researchenvironments is not fully explored.The implementation of near-duplicate detection algorithms forces trade-offsbetween efficiency and effectiveness, entailing careful testing andmeasurement to ensure acceptable performance.In this thesis, we describe and evaluate a scalable implementation of anear-duplicate detection algorithm, based on standard shingling techniques,running under a MapReduce framework.We explore two different shingle sampling techniques and analyzetheir impact on the near-duplicate document detection process.In addition, we investigate the prevalence of near-duplicate documentsin the runs submitted to the adhoc task of TREC 2009 web track.
【 预 览 】
附件列表
Files
Size
Format
View
The Impact of Near-Duplicate Documents on Information Retrieval Evaluation