科技报告详细信息
Counting Positives Accurately Despite Inaccurate Classification
Forman, George
HP Development Company
关键词: supervised machine learning;    estimation;    mixture models;    shifting class prior;    non-stationary class distribution;   
RP-ID  :  HPL-2005-96R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set--quantification--as opposed to classifying individual cases accurately. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set--a common case in information retrieval. Notes: Copyright 2005 Springer-Verlag. Published in and presented at the 16th European Conference on Machine Learning (ECML'05), 3-7 October 2005, Porto, Portugal http://ecmlpkdd05.liacc.up.pt/ 12 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001287LZ 319KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:44次