科技报告

【摘要】

This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measures such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions, we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).

【预览】

附件列表
Files	Size	Format	View
	2090KB	PDF	download


Machine Learning for Big Data: A Study to Understand Limits at Scale

Sukumar, Sreenivas R.¹ Del-Castillo-Negrete, Carlos Emilio¹
[1] Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
关键词: machine learning; Big Data; Scalability; scale; limits of machine learning;
DOI : 10.2172/1234336 RP-ID : ORNL/TM--2015/344 PID : OSTI ID: 1234336
美国\|英语
来源: SciTech Connect
PDF


	文献评价指标
	下载次数：27次	浏览次数：50次

【 摘 要 】

【 预 览 】

【摘要】

【预览】