学位论文详细信息
Analytics for Everyone
analytics;SQL engine;code generation;Column-oriented;explanation tables;data exploration;informative;interpretable;javascript;browser;in-browser;mnemonic;open data;open government;RDBMS
El Gebaly, Kareemadvisor:Lin, Jimmy ; advisor:Golab, Lukasz ; affiliation1:Faculty of Mathematics ; advisor:Aboulnaga, Ashraf ; Lin, Jimmy ; Aboulnaga, Ashraf ; Golab, Lukasz ;
University of Waterloo
关键词: Column-oriented;    informative;    open data;    open government;    Doctoral Thesis;    explanation tables;    javascript;    analytics;    code generation;    RDBMS;    browser;    mnemonic;    data exploration;    SQL engine;    in-browser;    interpretable;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/13350/1/ELGEBALY_KAREEM.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

Analyzing relational data typically involves tasks that facilitate gaining familiarity or insightsand coming up with findings or conclusions based on the data. This process is usually practicedby data experts, such as data scientists, who share their output with a potentially less expertaudience (everyone). Our goal is to enable everyone to participate in analyzing data rather thanpassively consuming its outputs (analytics democratization). With today’s increasing availabilityof data (data democratization) on the internet (web) combined with already widespread personalcomputing capabilities such a goal is becoming more attainable. With the recent increase ofpublic data, i.e., Open Data, users without a technical background are keener than ever to analyzenew data sets that are relevant to wide sectors of society. An important example of Open Data isthe data released by governments all over the world, i.e., Open Government.This dissertation focuses on two main challenges that would face data exploration scenariossuch as exploring open data found over the web. First, the infrastructure necessary for interactivedata exploration is costly and hard to manage, especially by users who do not have technicalknowledge. Second, the target users need guidance through the data exploration since there aretoo many starting points.To eliminate challenges related to managing infrastructure, we propose an in-browser SQLengine (serverless), i.e., a portable database, which we call Afterburner. Afterburner achievescomparable performance to native SQL engines given the same resources on modestly sized datasets. Afterburner uses code generation techniques that target an optimization-amenable subsetof JavaScript and employs typed arrays for its columnar-based in-memory storage. In addition,for databases that are too large for the browser, we propose a hybrid architecture to acceleratethe performance of data exploration tasks: a one-time SQL query that runs at the backend andSQL queries running in the browser as per user’s interactions. Based on a simple hint by theuser, Afterburner automatically splits queries into two parts: a backend query that generates amaterialized view that is shipped to the browser, and a frontend query per subsequent interactionoccur locally against this view. Optimizing queries using local materialized views inside thebrowser accelerates query latency without adding any complexity to the backend or the frontend.One common theme among many data exploration tasks revolves around navigating the manydifferent ways to group the data, i.e., exploring the data cube. Thus, to guide the user through dataexploration, we apply an information-theoretic technique that picks the most informative partsfrom the entire data cube of a relational table, which is called Explanation Tables. We evaluate theefficiency and effectiveness of a sampling-based technique for generating explanation tables thatachieves comparable quality to an exhaustive technique that considers the entire data cube, witha significant reduction in the run time. In addition, we introduce optimizations to explanationtables to fit the modest resources available in the browser without any external dependencies.In this, we present an SQL engine and a data exploration guidance tool that run entirely inthe browser. We view the techniques and the experiments presented here as a fully functionaland open-sourced proof of viability of our proposal. Our analytical stack is portable and worksentirely in the browser. We show that SQL and exploration guidance can be as accessible as aweb page, which opens the opportunity for more people to analyze data sets. Facilitating dataexploration for everyone is one step closer towards analytics democratization where everyonecan participate in data exploration, not just the experts.

【 预 览 】
附件列表
Files Size Format View
Analytics for Everyone 1319KB PDF download
  文献评价指标  
  下载次数:21次 浏览次数:27次