会议论文详细信息
SEPLN'09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse
ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ Cristian Grozea Fraunhofer FIRST IDA Group
ENCOPLOT: Pairwise Sequence Matching in Linear Time ; Christian Gehl ; Marius Popescu
Others  :  http://CEUR-WS.org/Vol-502/paper2.pdf
PID  :  2042
来源: CEUR
PDF
【 摘 要 】

In this paper we describe a new general plagiarism detection method,that we used in our winning entry to the 1st International Competition on Plagia- rism Detection, the external plagiarism detection task, which assumes the source documents are available. In the first phase of our method, a matrix of kernel values is computed, which gives a similarity value based on n-grams between each source and each suspicious document. In the second phase, each promising pair is further investigated, in order to extract the precise positions and lengths of the subtexts that have been copied and maybe obfuscated – using encoplot, a novel linear time pairwise sequence matching technique. We solved the significant computational chal- lenges arising from having to compare millions of document pairs by using a library developed by our group mainly for use in network security tools. The performance achieved is comparing more than 49 million pairs of documents in 12 hours on a single computer. The results in the challenge were very good, we outperformed all other methods.

【 预 览 】
附件列表
Files Size Format View
ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ Cristian Grozea Fraunhofer FIRST IDA Group 856KB PDF download
  文献评价指标  
  下载次数:10次 浏览次数:10次