学位论文详细信息
New capabilities for large-scale exploratory data analysis
Exploratory Data Analysis;Data Management
Xu, Liqi
关键词: Exploratory Data Analysis;    Data Management;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/107971/XU-DISSERTATION-2020.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

The ever-rising diversity of data generated, manipulated, and analyzed every day engenders a variety of data formats, ranging from one fixed dataset to multiple versions of a dataset stored across multiple data sources. This variety of formats has led to substantial challenges in data exploration. Existing systems do not effectively support querying capabilities across these formats: (i) Browsing: When exploring a single dataset, data scientists often need to examine a collection of records that satisfy arbitrary predicates. However, current exploratory data analysis tools mainly focus on visual summarization over browsing. (ii) Versioning: With the proliferation of dataset versions generated during different stages of exploration, exploratory data analysis is no longer just about exploring one static dataset. Instead, data scientists need to keep track of massive numbers of versions, as well as search for versions with specific criteria. (iii) Integrating: Nowadays, datasets are collected and stored at multiple sources (e.g., as part of the IoT). When exploring data, data scientists often need to query and join data across databases at disparate locations.In this dissertation, we propose systems that enable query capabilities to efficiently and effectively fulfill these new demands in data exploration. (i) For browsing, we develop NEEDLETAIL, a data exploration engine that employs a light-weight indexing structure along with efficient algorithms to retrieve any-k valid records for arbitrary queries as quickly as possible. (ii) For versioning, we implement and open-source ORPHEUSDB, a dataset version control system that can efficiently track and query across dataset versions. Since versioning queries in ORPHEUSDB take advantage of array operators in relational database systems, we also conduct an extensive experimental study on understanding array implementations in modern database systems. (iii) For integrating, we leverage machine learning techniques to optimize federated query processing and eventually improve the interactivity of data exploration across disparate databases.

【 预 览 】
附件列表
Files Size Format View
New capabilities for large-scale exploratory data analysis 3995KB PDF download
  文献评价指标  
  下载次数:8次 浏览次数:31次