学位论文详细信息
Spam Filter Improvement Through Measurement
evaluation methodology;spam filtering;spam corpora;spam fusion;Computer Science
Lynam, Thomas Richard
University of Waterloo
关键词: evaluation methodology;    spam filtering;    spam corpora;    spam fusion;    Computer Science;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/4344/1/thesis.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

This work supports the thesis that sound quantitative evaluation forspam filters leads to substantial improvement in the classificationof email. To this end, new laboratory testing methods and datasetsare introduced, and evidence is presented that their adoption at TextREtrieval Conference (TREC)and elsewhere has led to an improvement in state of the artspam filtering. While many of these improvements have been discoveredby others, the best-performing method known at this time -- spam filterfusion -- was demonstrated by the author.This work describes four principal dimensions of spam filter evaluationmethodology and spam filter improvement. An initial study investigatesthe application of twelve open-source filter configurations in a laboratoryenvironment, using a stream of 50,000 messages captured from a singlerecipient over eight months. The study measures the impact of userfeedback and on-line learning on filter performance using methodologyand measures which were released to the research community as theTREC Spam Filter Evaluation Toolkit.The toolkit was used as the basis of the TREC Spam Track, which theauthor co-founded with Cormack. The Spam Track, in addition to evaluatinga new application (email spam), addressed the issue of testing systemson both private and public data. While streams of private messagesare most realistic, they are not easy to come by and cannot be sharedwith the research community as archival benchmarks. Using the toolkit,participant filters were evaluated on both, and the differences foundnot to substantially confound evaluation; as a result, public corporawere validated as research tools. Over the course of TREC and similarevaluation efforts, a dozen or more archival benchmarks --some private and some public -- have become available.The toolkit and methodology have spawned improvements in the stateof the art every year since its deployment in 2005. In 2005, 2006,and 2007, the spam track yielded new best-performing systems basedon sequential compression models, orthogonal sparse bigram features,logistic regression and support vector machines. Using the TREC participantfilters, we develop and demonstrate methods for on-line filter fusionthat outperform all other reported on-line personal spam filters.

【 预 览 】
附件列表
Files Size Format View
Spam Filter Improvement Through Measurement 2721KB PDF download
  文献评价指标  
  下载次数:16次 浏览次数:21次