学位论文

【摘要】

As applications are moving towards peta and exascale data sets, it has become increasingly important to develop more efficient data retrieval and storage mechanisms that will aid in reducing network traffic, server load, as well as minimizing user perceived retrieval delays. We propose an Intelligent Caching technique and a Graph Summarization technique in order to achieve low latency data retrieval for big data based applications. Our caching approach is developed on top of HDFS to optimize the read latency of HDFS. HDFS is primarily suitable for Write Once Read Many (WORM) applications where the number of reads is significantly more than that of writes. In our Intelligent Caching approach, we analyze real world map reduce traces from Facebook and Yahoo in terms of file size and access pattern distribution. We combine it with the existing analysis from literature to develop a new caching algorithm that builds on top of the HDFS caching API recently released.Based on the findings that a majority of accesses in a map reduce cluster occur within the first 2 hours of file creation, our caching algorithm uses a sliding window approach to ensure that most popular files remain in cache at appropriate time instances. It uses file characteristics for a particular window to determine a file's popularity. File popularity is calculated using file access patterns, file age and workload characteristics. We use a simulator based technique to evaluate our algorithm on various performance metrics by using real world and synthetic traces. We have compared our algorithm with some of the existing variants of LRU/LFU. Recent rapid growth in real-world social networks has incentivized researchers to explore optimizations which can provide quick insights about the network. Due to this motivation, graph summarization and approximation has been an important research problem. Most of the work in this area has been focused on concise and informative representations of large graph. These large graphs are billion nodes and edges graphs and need a distributed storage/processing system for any kind of operations on them. Our work primarily focuses on task-based summarization of large graphs that are stored in a distributed fashion and answer queries which are computationally expensive on original graph, but have tolerance with regards to minor errors in exact results. These queries, semantically, provide the same amount of information even with approximate results. Our contribution is a distributed framework which can answer queries probabilistically in a highly efficient way using compact representations of original graph stored in form of summary graphs across a cluster of multiple nodes. These summary graphs are also optimized for space complexity, and only grow in terms of the number of attributes used to answer the query. One can then use a combination of these graphs to answer complex queries in an extremely efficient manner. Our results are promising and show that significant gains in runtime can be achieved using our framework without sacrificing too much on accuracy. In fact, we observe decreasing trend in error as the graph size increases.

【预览】

附件列表
Files	Size	Format	View
Low latency data retrieval solutions for big data	947KB	PDF	download


Low latency data retrieval solutions for big data
Caching;HDFS;Low Latency Retrieval;Distributed Graph Summarization and Processing
Datye, Chaitanya Sanjeev ; Campbell ; Roy H.
关键词: Caching; HDFS; Low Latency Retrieval; Distributed Graph Summarization and Processing;
Others : https://www.ideals.illinois.edu/bitstream/handle/2142/90955/DATYE-THESIS-2016.pdf?sequence=1&isAllowed=y
美国\|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF


	文献评价指标
	下载次数：4次	浏览次数：46次

【 摘 要 】

【 预 览 】

【摘要】

【预览】