科技报告

【摘要】

Good feature selection is essential for text classification to make it tractable for machine learning, and to improve classification performance. This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed for various objectives. For best accuracy, F-measure or recall, the findings reveal an outstanding new feature selection metric, "Bi-Normal Separation" (BNS). For precision alone, however, Information Gain (IG) was superior. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner who seeks to choose one or two metrics to try that are mostly likely to have the best performance for the single dataset at hand. This analysis determined, for example, that IG and Chi-Squared have correlated failures for precision, and that IG paired with BNS is a better choice. Notes: Copyright Springer-Verlag. Published in and presented at the 13th European Conference on Machine Learning (ECML '02)/6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 19-23 August 2002, Helsinki, Finland 12 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100001922LZ	186KB	PDF	download


Choose Your Words Carefully: An Empirical Study of Feature Selection

Forman, George
HP Development Company
关键词: supervised machine learning; document categorization; support vector machines; binormal separation; residual failure analysis;
RP-ID : HPL-2002-88R2
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：16次	浏览次数：41次

【 摘 要 】

【 预 览 】

【摘要】

【预览】