Harnessing the Power of Multi-Source Data: an Exploration of Diversity and Similarity.
Multi-source data quality control;Crowd-sourcing;Online learning and decision making;Data diversity and similarity;Cyber security measurement;Electrical Engineering;Engineering;Electrical Engineering: Systems
This dissertation studies a sequence of problems concerning the collection and utilization of data from disparate sources, e.g., that arising in a crowd-sourcing system. It aims at developing learning methods to enhance the quality of decision-making and learning task performance by exploiting a multitude of diversity, similarity and interdependency inherent in a crowd-sourcing system and among disparate data sources. We start our study with a family of problems on sequential decision-making combined with data collection in a crowd-sourcing system, where the goal is to improve the quality of data input or computational output, while reducing the cost in using such a system.In this context, the learning methods we develop are closed-loop and online, i.e., decisions made are functions of past data observations, present actions determine future observations, and the learning occurs as data inputs arrive.The similarity and disparity among different data sources help us in some cases to speed up the learning process (e.g., in a recommender system), and in some other cases to perform quality control over data input for which ground-truth may be non-existent or cannot be obtained directly (e.g., in a crowd-sourcing market using Amazon Mechanical Turks (AMTs)). We then apply our algorithms to the processing of a large set of network malicious activity data collected from diverse sources, with a goal of uncovering interconnectedness/similarity between different network entities;; malicious behaviors.Specifically, we apply our online prediction algorithm presented and analyzed in earlier parts of the dissertation to this data and show its effectiveness in predicting next-day maliciousness.Furthermore, we show that data-specific properties of this set of data allow us to map networks;; behavioral similarity to similarity in their topological features.This in turn enablesprediction even in the absence of measurement data.
【 预 览 】
附件列表
Files
Size
Format
View
Harnessing the Power of Multi-Source Data: an Exploration of Diversity and Similarity.