科技报告

【摘要】

Consider the following common scenario: a data mining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best. Supposing its accuracy were 70% on a held-out test set, how can one know whether this is a significant result or not? It can be difficult to tell in the absence of standard benchmark results for the dataset. Surprisingly, it can also be difficult to tell even when the dataset has hundreds of benchmark results. This paper presents a method to address this question by comparing the chosen best classifier to the distribution of performance scores obtained by many simple classifiers that are randomly generated. This can also serve to discover when a classification problem appears nearly unlearnable. It is demonstrated for the results of the 2001 KDD Cup thrombin competition. Notes: To be published in and presented at Data Mining Lessons Learned Workshop, the 19th International Conference on Machine Learning (ICML), 8-12 July 2002, Sydney, Australia 5 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100001863LZ	680KB	PDF	download


A Method for Discovering the Insignificance of Ones Best Classifier

Forman, George
HP Development Company
关键词: supervised machine learning; overfitting; 2001 KDD Cup thrombin classification competition;
RP-ID : HPL-2002-123R2
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：23次	浏览次数：88次

【 摘 要 】

【 预 览 】

【摘要】

【预览】