科技报告详细信息
A Pitfall and Solution in Multi-Class Feature Selection for Text
Forman, George
HP Development Company
关键词: benchmark comparison;    text classification;    information retrieval;    F-measure;    precision in the top 10;    small training sets;    skewed/unbalanced class distribution;   
RP-ID  :  HPL-2004-86
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements. Notes: Published in and presented at the 21st International Conference on Machine Learning, 4-8 July 2004, Banff, Alberta, Canada 8 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001075LZ 318KB PDF download
  文献评价指标  
  下载次数:15次 浏览次数:22次