学位论文详细信息
Scalable and flexible bulk architecture
Memory consistency model;Chunk-based Architecture;Cache Coherence
Qian, Xuehai
关键词: Memory consistency model;    Chunk-based Architecture;    Cache Coherence;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/45472/Xuehai_Qian.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

Multicore machines have become pervasive and, as a result, parallel programming has receivedrenewed interest. Unfortunately, writing correct parallel programs is notoriously hard. Lookingahead, multicore designs should take into account support for programmability and productivity,and make it one of the top-class design considerations.This thesis focuses on efficient and scalable architecture supports to improve the programmabilityof shared-memory architectures. Specifically, we focus on supporting Sequential Consistency(SC), a strong and intuitive memory consistency model. The first part of the thesis focuseson enforcing SC by chunk-based execution. I propose techniques to remove the scalability bottlenecksof chunk-based architectures. Also, I propose the design of an SMT processor to supportchunk operations among the contexts in the same processor. The second part of the thesis focuseson enforcing high performance whole-system SC, from language to architecture, by speculativechunk ordering. The third part of the thesis focuses on dynamically detecting SC violations in adirectory-based cache coherence protocol precisely.For chunk-based execution to be competitive, a machine must support chunk operations veryefficiently. In my research, I focus on an environment with lazy conflict detection. In this environment,a major bottleneck in a large manycore with directory-based coherence is the chunk commitoperation. The reason is that a chunk must appear to commit all of its writes atomically — eventhough the addresses written by the chunk belong to different, distributed directory modules. In addition,the commit may have to compete against other committing chunks that have accessed someof the same addresses—hence prohibiting concurrent commit. To resolve this commit bottleneck,I propose two scalable chunk-commit protocols.The first protocol, called ScalableBulk, innovates with a set of primitives that enable a scalablecoherence protocol designed for chunks. Specifically, ScalableBulk is the first work that integratessignatures into the directory design. Signatures enable the concurrent commit of any number ofchunks that use the same directory module—as long as their addresses do not overlap. In addition,ScalableBulk introduces a commit protocol that groups all the directories relevant to the chunk ina way that ensures: (i) multiple groups of directories with non-overlapping addresses can formsuccessfully concurrently and (ii) if the directory groups have overlapping addresses, at least oneof the groups forms.The second protocol, called IntelliSense, targets two inefficiencies in ScalableBulk. First, aScalableBulk commit grabs the relevant directory modules in a sequential manner to ensure deadlockfreedom, which may incur long latency for large directory groups. Second, two chunks withcross-processor write-after-write (WAW) dependences between them cannot commit concurrently;one squashes the other, even though these are name dependences.To solve the first problem, I propose the IntelliCommit protocol, where a commit grabs all therelevant directory modules in parallel. The idea is for the committing processor to send commitrequestprotocol messages to all of the relevant directory modules in parallel, get their responsesdirectly, and finally send them a commit-confirm message.To solve the second problem, I propose the IntelliSquash mechanism. It uses an idea similar tothe store buffers in current processors to serialize, without any squash, the commits of two chunksthat only have WAWs. The result is that the write sets of the two chunks are applied in a serialmanner without squashes.To support chunk-based execution in Simultaneous Multithreading (SMT) processors, I proposeBulkSMT [59]. It exploits the close proximity of the contexts in an SMT processor to concurrentlyrun dependent chunks with simple hardware. I perform a broad design space analysis. Thedesigns analyzed include three different schemes for conflict resolution inside the SMT processor.As a result of the analysis, I show for the first time that SMT processors are very cost-effective insupporting the concurrent execution of dependent chunks.The chunk-based execution is effective at enforcing SC in hardware. However, since a memorymodel deals with the whole computing stack, its semantics are well-defined only when the modelis specified and enforced consistently in every layer, from the language to the hardware. Therefore,to harness the benefits of SC, hardware-only SC enforcement is not sufficient — the software caneasily violate SC even if the hardware implementation is correct. For correctness, we need toguarantee SC in every system layer, which is called whole-system SC.To enable high performance whole-system SC, I propose UniBlock, the first scheme built froma conventional distributed cache coherence protocol that prevents SC violations due to hardwareand software with the same set of techniques. The basic concept in UniBlock is the ordered chunk,which is used by the hardware as the mechanism to enforce hardware SC, and by the compileras the specification to guide to scope of compiler optimizations that could violate SC. Startingfrom a conventional relaxed-consistency coherence protocol, UniBlock forms intermittent dynamicchunks when the speculative retirement of an instruction may violate SC. The compiler also marksthe optimized code regions as static chunks to ensure correct execution. UniBlock treats static anddynamic chunks in a unified manner, and cleanly supports whole-system SC.The above techniques are used to enforce SC, and involve some changes in the cache coherenceprotocol. The last work of this thesis is to detect SC violations based on a conventional cachecoherence protocol. To address this problem, I propose Volition [60], the first scalable and precisehardware SC violation (SCV) detector that detects SCVs involving an arbitrary number of processors.Volition detects SCVs dynamically as a program runs. While it can be applied to bothdirectory and bus-based coherence protocols, it does not rely on any property that is only availablein a bus, such as broadcast. Volition’s idea is to dynamically detect data-dependence cycles acrossprocessors by piggybacking information on the coherence transactions. When an SCV is detected,an exception is raised to the software. For a given dynamic execution, Volition suffers no falsepositives or negatives.

【 预 览 】
附件列表
Files Size Format View
Scalable and flexible bulk architecture 1863KB PDF download
  文献评价指标  
  下载次数:17次 浏览次数:9次