科技报告详细信息
A Method for Discovering the Insignificance of Ones Best Classifier
Forman, George
HP Development Company
关键词: supervised machine learning;    overfitting;    2001 KDD Cup thrombin classification competition;   
RP-ID  :  HPL-2002-123R2
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Consider the following common scenario: a data mining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best. Supposing its accuracy were 70% on a held-out test set, how can one know whether this is a significant result or not? It can be difficult to tell in the absence of standard benchmark results for the dataset. Surprisingly, it can also be difficult to tell even when the dataset has hundreds of benchmark results. This paper presents a method to address this question by comparing the chosen best classifier to the distribution of performance scores obtained by many simple classifiers that are randomly generated. This can also serve to discover when a classification problem appears nearly unlearnable. It is demonstrated for the results of the 2001 KDD Cup thrombin competition. Notes: To be published in and presented at Data Mining Lessons Learned Workshop, the 19th International Conference on Machine Learning (ICML), 8-12 July 2002, Sydney, Australia 5 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001863LZ 680KB PDF download
  文献评价指标  
  下载次数:23次 浏览次数:88次