学位论文

【摘要】

Facing the challenges of large amounts of data generated by various companies (such as Facebook, Amazon, and Twitter), cloud computing frameworks such as Hadoop are used to store and process the Big Data. Hadoop, an open source cloud computing framework, is popular because of its scalability and fault tolerance. However, by frequently writing and reading data from the Hadoop Distributed File System (HDFS), Hadoop is quite slow in many applications. Apache Spark, a new cloud computing framework developed at AMPLab of UC Berkeley, solves this problem by caching data in memory. Spark develops a new abstraction called resilient distributed dataset (RDD) which is both scalable and fault-tolerant. In this thesis, we describe the architecture of Hadoop and Spark and discuss their differences. Properties of RDDs and how they work in Spark are discussed in detail, which gives a guide on how to use them efficiently. The main contribution of the thesis is to implement the PageRank algorithm and Conjugate Gradient (CG) method in Hadoop and Spark, and show how Spark out-performs Hadoop by taking advantage of memory caching.

【预览】

附件列表
Files	Size	Format	View
Implementations of iterative algorithms in Hadoop and Spark	2906KB	PDF	download


Implementations of iterative algorithms in Hadoop and Spark
Hadoop;Spark;Resilient Distributed Datasets;Conjugate Gradient method;Applied Mathematics
Lai, Junyu
University of Waterloo
关键词: Hadoop; Spark; Resilient Distributed Datasets; Conjugate Gradient method; Applied Mathematics;
Others : https://uwspace.uwaterloo.ca/bitstream/10012/8586/1/Lai_Junyu.pdf
瑞士\|英语
来源: UWSPACE Waterloo Institutional Repository
PDF


	文献评价指标
	下载次数：32次	浏览次数：18次

【 摘 要 】

【 预 览 】

【摘要】

【预览】