2017 2nd International Seminar on Advances in Materials Science and Engineering | |
A strategy to load balancing for non-connectivity MapReduce job | |
Zhou, Huaping^1 ; Liu, Guangzong^1 ; Gui, Haixia^1 | |
College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan Anhui | |
232000, China^1 | |
关键词: Complex datasets; Data distribution; Data partitioning; Data skew; Distributed programming model; Map-reduce; Real data sets; Sampling method; | |
Others : https://iopscience.iop.org/article/10.1088/1757-899X/231/1/012038/pdf DOI : 10.1088/1757-899X/231/1/012038 |
|
来源: IOP | |
【 摘 要 】
MapReduce has been widely used in large scale and complex datasets as a kind of distributed programming model. Original Hash partitioning function in MapReduce often results the problem of data skew when data distribution is uneven. To solve the imbalance of data partitioning, we proposes a strategy to change the remaining partitioning index when data is skewed. In Map phase, we count the amount of data which will be distributed to each reducer, then Job Tracker monitor the global partitioning information and dynamically modify the original partitioning function according to the data skew model, so the Partitioner can change the index of these partitioning which will cause data skew to the other reducer that has less load in the next partitioning process, and can eventually balance the load of each node. Finally, we experimentally compare our method with existing methods on both synthetic and real datasets, the experimental results show our strategy can solve the problem of data skew with better stability and efficiency than Hash method and Sampling method for non-connectivity MapReduce task.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
A strategy to load balancing for non-connectivity MapReduce job | 376KB | download |