期刊论文详细信息
Jurnal Terapan Teknologi Informasi: JUTEI
Penerapan Simhash dan Hamming Distance dalam Deteksi Kemiripan Teks Berita
article
Mayesti Anggelina1  Lucia Dwi Krisnawati1  Danny Sebastian1 
[1] Informatika, Universitas Kristen Duta Wacana
关键词: daur ulang teks;    deteksi kemiripan teks;    hamming distance;    simhash;   
DOI  :  10.21460/jutei.2022.62.216
来源: Universitas Kristen Duta Wacana
PDF
【 摘 要 】

Text reuse is defined as the reuse of existing written sources for creating a new text. The degree of reuse varies from duplicate, near-duplicate to topically similar text. Though some genres of text reuse are acceptable, their existence causes inefficiency of searching and waste of storage. To overcome this problem, a textual similarity detection system is needed. This study focuses on detecting the text similarity by applying the Simhash algorithm. It is used to create document fingerprints which function as document features through which the degree of text similarity can be compared. The similarity of a suspicious text to the source documents are measured then by Hamming Distance. Focusing on the duplicate and near-duplicate detection, the experiments conducted show that the recall of the duplicate detection  reaches 80%, meaning that the system is capable of retrieving the duplicate sources of the suspicious document.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO202307110004782ZK.pdf 1138KB PDF download
  文献评价指标  
  下载次数:0次 浏览次数:0次