学位论文详细信息
Physical Design for Non-relational Data Systems
data systems;physical design;NoSQL;normalization;Apache Cassandra;Apache Spark
Mior, Michaeladvisor:Salem, Kenneth ; affiliation1:Faculty of Mathematics ; Salem, Kenneth ;
University of Waterloo
关键词: physical design;    NoSQL;    normalization;    data systems;    Apache Cassandra;    Doctoral Thesis;    Apache Spark;   
Others  :  https://uwspace.uwaterloo.ca/bitstream/10012/13555/3/Mior_Michael.pdf
瑞士|英语
来源: UWSPACE Waterloo Institutional Repository
PDF
【 摘 要 】

Decades of research have gone into the optimization of physical designs, query execution, and related tools for relational databases. These techniques and tools make it possible for non-expert users to make effective use of relational database management systems. However, the drive for flexible data models and increased scalability has spawned a new generation of data management systems which largely eschew the relational model. These include systems such as NoSQL databases and distributed analytics frameworks such as Apache Spark which make use of a diverse set of data models. Optimization techniques and tools developed for relational data do not directly apply in this setting. This leaves developers making use of these systems with the need to become intimately familiar with system details to obtain good performance.We present techniques and tools for physical design for non-relational data systems. We explore two settings: NoSQL database systems and distributed analytics frameworks. While NoSQL databases often avoid explicit schema definitions, many choices on how to structure data remain. These choices can have a significant impact on application performance. The data structuring process normally requires expert knowledge of the underlying database. We present the NoSQL Schema Evaluator (NoSE). Given a target workload, NoSE provides an optimized physical design for NoSQL database applications which compares favourably to schemas designed by expert users. To enable existing applications to benefit from conceptual modeling, we also present an algorithm to recover a logical model from a denormalized database instance.Our second setting is distributed analytics frameworks such as Apache Spark. As is the case for NoSQL databases, expert knowledge of Spark is often required to construct efficient data pipelines. In NoSQL systems, a key challenge is how to structure stored data, while in Spark, a key challenge is how to cache intermediate results. We examine a particularly common scenario in Spark which involves performing iterative analysis on an input dataset. We show that jobs written in an intuitive manner using existing Spark APIs can have poor performance. We propose ReSpark, which automates caching decisions for iterative Spark analyses. Like NoSE, ReSpark makes it possible for non-expert users to obtain good performance from a non-relational data system.

【 预 览 】
附件列表
Files Size Format View
Physical Design for Non-relational Data Systems 1178KB PDF download
  文献评价指标  
  下载次数:12次 浏览次数:26次