科技报告

【摘要】

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements. Notes: Published in and presented at the 21st International Conference on Machine Learning, 4-8 July 2004, Banff, Alberta, Canada 8 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100001075LZ	318KB	PDF	download


A Pitfall and Solution in Multi-Class Feature Selection for Text

Forman, George
HP Development Company
关键词: benchmark comparison; text classification; information retrieval; F-measure; precision in the top 10; small training sets; skewed/unbalanced class distribution;
RP-ID : HPL-2004-86
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：15次	浏览次数：22次

【 摘 要 】

【 预 览 】

【摘要】

【预览】