科技报告详细信息
Choose Your Words Carefully: An Empirical Study of Feature Selection
Forman, George
HP Development Company
关键词: supervised machine learning;    document categorization;    support vector machines;    binormal separation;    residual failure analysis;   
RP-ID  :  HPL-2002-88R2
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Good feature selection is essential for text classification to make it tractable for machine learning, and to improve classification performance. This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed for various objectives. For best accuracy, F-measure or recall, the findings reveal an outstanding new feature selection metric, "Bi-Normal Separation" (BNS). For precision alone, however, Information Gain (IG) was superior. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner who seeks to choose one or two metrics to try that are mostly likely to have the best performance for the single dataset at hand. This analysis determined, for example, that IG and Chi-Squared have correlated failures for precision, and that IG paired with BNS is a better choice. Notes: Copyright Springer-Verlag. Published in and presented at the 13th European Conference on Machine Learning (ECML '02)/6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 19-23 August 2002, Helsinki, Finland 12 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001922LZ 186KB PDF download
  文献评价指标  
  下载次数:16次 浏览次数:41次