7th Workshop on Large-Scale Distributed Systems for Information Retrieval | |
The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce | |
Jimmy Lin | |
Others : http://CEUR-WS.org/Vol-480/paper3.pdf PID : 11475 |
|
来源: CEUR | |
【 摘 要 】
This paper explores the problem of “stragglers" in Map- Reduce: a common phenomenon where a small number of mappers or reducers takes significantly longer than the others to complete. The effects of these stragglers include unnecessarily long wall-clock running times and sub-optimal cluster utilization. In many cases, this problem cannot simply be attributed to hardware idiosyncrasies, but is rather caused by the Zipfian distribution of input or intermediate data. I present a simple theoretical model that shows how such distributions impose a fundamental limit on the amount of parallelism that can be extracted from a large class of algorithms where all occurrences of the same element must be processed together. A case study in parallel ad hoc query evaluation highlights the severity of the stragglers problem. Fortunately, a simple modification of the input data cuts end-to-end running time in half. This example illustrates some of the issues associated with designing effcient MapReduce algorithms for real-world datasets.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce | 222KB | download |