科技报告详细信息
Machine Learning for Big Data: A Study to Understand Limits at Scale
Sukumar, Sreenivas R.1  Del-Castillo-Negrete, Carlos Emilio1 
[1] Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
关键词: machine learning;    Big Data;    Scalability;    scale;    limits of machine learning;   
DOI  :  10.2172/1234336
RP-ID  :  ORNL/TM--2015/344
PID  :  OSTI ID: 1234336
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measures such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions, we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).

【 预 览 】
附件列表
Files Size Format View
2090KB PDF download
  文献评价指标  
  下载次数:27次 浏览次数:50次