This dissertation describes work on the architecture of throughput-oriented accelerator processors.First, we examine the limitations of current accelerator processors and identify an opportunity to enable high throughput while also providing a more general-purpose programming model.To address this opportunity,we present Rigel, a single-chip accelerator architecture with 1024 independent processing cores targeted at a broad class of data- and task-parallel computation. Enabled by the feasibility of large die sizes combined with increasing transistor densities, we show that such an aggressive design can be implemented in today's process technology within acceptable area and power limits. We discuss our motivation for such a design and evaluate the performance scalability as well as power and area requirements. We also describe the Rigel memory system, including the Task Centric Memory Model software coherence protocol, the Cohesion hybrid memory model, and lazy atomic operations. We describe the Rigel toolflow, a set of tools we have developed for evaluating manycore accelerator architectures. The Rigel toolset includes an architectural simulator, LLVM-based compiler, parallel benchmarks, RTL models, and associated infrastructure scripts and toolflows. We have prepared an open-source release of portions of the resulting toolset for the use of the broader research community. Such a release will enable others to perform further work in the area of accelerator design.We present multi-level scheduling, a technique developed for throughput-oriented graphics processing units (GPUs) designed to reduce complexity and energy consumption. Modern GPUs employ a large number of hardware threads to hide both long and short latencies. Supporting tens of thousands of hardware threads requires a complex scheduler and a large register file which is expensive to access in terms of energy and latency. With multi-level scheduling, we divide threads into a smaller set of active threads to hide short latencies and larger set of pending threads for hiding long latencies to main memory. By reducing the concurrently active number of threads, we enable more efficient scheduler and register file structures.Finally, we describe opportunities for employing similar hierarchical multithreading techniques to MIMD accelerator designs such as Rigel. We extend the original Rigel architecture with a new multithreaded microarchitecture. We propose a novel, flexible multithreading paradigm that allows the architect a flexible way to scale the number of threads to match the requirements of targeted workloads. We show that this new multithreaded architecture can be implemented efficiently while providing more flexibility to the architect.
【 预 览 】
附件列表
Files
Size
Format
View
Multithreaded architectures for manycore throughput processors