科技报告详细信息
BNS Scaling: An Improved Representation over TF·IDF for SVM Text
Forman, George
HP Development Company
关键词: text classification;    topic identification;    machine learning;    feature selection;    Support Vector Machine;    TF*IDF text representation;   
RP-ID  :  HPL-2007-32R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

In the realm of machine learning for text classification, TF·IDF is the most widely used representation for real-valued feature vectors. Unfortunately, it is oblivious to the training class labels, and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which was previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using the BNS scaling representation. A wide variety of other feature scaling methods were found inferior, including binary features. Furthermore, BNS scaling yielded better performance without feature selection, obviating the complexities of feature selection.

【 预 览 】
附件列表
Files Size Format View
RO201804100001729LZ 242KB PDF download
  文献评价指标  
  下载次数:55次 浏览次数:66次