The 2nd International Workshop on Adversarial Information Retrieval on the Web | |
Tracking Web Spam with Hidden Style Similarity | |
计算机科学;图书情报档案学 | |
Tanguy Urvoy ; Thomas Lavergne ; Pascal Filoche | |
Others : http://airweb.cse.lehigh.edu/2006/urvoy.pdf PID : 7236 |
|
来源: CEUR | |
【 摘 要 】
Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimous spamdexing attempts (e.g. link farms, faked directories. . . ). Those pages built using the same generating method (template or script) share a common \look and feel" that is not easily detected by common text classification methods, but is more related to stylometry. In this paper, we present a (hidden) style similarity measure based on extra-textual features in html source code. We also describe a method to clusterize a large collection of documents according to this measure. The clustering algorithm being based on fingerprints, we also give some recalls about fingerprinting. By conveniently sorting the generated clusters, one can efficiently track back instances of a particular automatic content generation method among web pages collected using a crawler. This is particularly useful to detect pages across different sites sharing the same design | this is often a good hint of either spamdexing attempt or mirrored content.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
Tracking Web Spam with Hidden Style Similarity | 2907KB | download |