科技报告

【摘要】

In many business and science applications, it is important to track trends over historical data, for example, measuring the monthly prevalence of influenza incidents at a hospital. In situations where a machine learning classifier is needed to identify the relevant incidents from among all cases in the database, anything less than perfect classification accuracy will result in a consistent and potentially substantial bias in estimating the class prevalence. There is an assumption ubiquitous in machine learning that the class distribution of the training set matches that of the test set, but this is certainly not the case for applications where the goal is to measure changes or trends in the distribution over time. The paper defines two research challenges for machine learning that address this distribution mismatch problem. The 'quantification' task is to accurately estimate the number of positive cases (or class distribution) in an unlabeled test set via machine learning, using a limited training set that may have a substantially different class distribution. The 'cost quantification' task is to estimate the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the hours of labor needed to resolve the case. Obtaining a precise quantification estimate over a set of cases has a very different utility model from traditional classification research, whose goal is to obtain an accurate classification for each individual case. For both forms of quantification, the paper describes a suitable experiment methodology and evaluates a variety of methods. It reveals which methods give more reliable estimates, even when training data is scarce and the testing class distribution differs widely from training. Some methods function well even under high class imbalance, e.g. 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor. Publication Info: To be published in international journal Data Mining and Knowledge Discovery in a special issue on Utility-Based Data Mining 25 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100001738LZ	397KB	PDF	download


Quantifying Counts, Costs, and Trends Accurately via Machine Learning

Forman, George
HP Development Company
关键词: supervised machine learning; classification; prevalence estimation; class distribution estimation; cost quantification; quantification research methodology; minimizing training effort; detecting and tracking trends; concept drift; class imbalance; text mining;
RP-ID : HPL-2007-164R1
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：29次	浏览次数：49次

【 摘 要 】

【 预 览 】

【摘要】

【预览】