科技报告详细信息
An Extensive Empirical Study of Feature Selection Metrics for Text
Forman, George
HP Development Company
关键词: supervised machine learning;    document categorization;    support vector machines;    information gain;    binormal;   
RP-ID  :  HPL-2002-147R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives--accuracy, F-measure, precision, and recall--since each is appropriate in different situations. The results reveal that a new feature selection metric, "Bi-Normal Separation" (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair --e.g. for greatest recall, the pair BNS+ F1-measure yielded the best performance on the greatest number of tasks by a considerable margin. Notes: To be published in the Journal of Machine Learning Research, Special Issue on Variable and Feature Selection. To view the dataset for this report, select the following link: http://www.hpl.hp.com/techreports/2002/HPL-2002-147R1-dataset.gz 19 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001815LZ 610KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:57次