学位论文

【摘要】

Matrix transposition is an important algorithmic building block for many numeric algorithms like multidimensional FFT. It has also been used to convert the storage layout of arrays. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of in-place transposition algorithms from CPU lacks the amount of parallelism and locality required by GPU to achieve good performance. In this thesis we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a staged transposition algorithm where each stage is performed using an elementary tiled-wise transposition. With both low-level optimizations to the elementary tiled-wise transpositions as well as high-level improvements to existing staged transposition algorithm, our design is able to reach more than 20 GB/s sustained throughput on modern GPUs, and a 3X speedup.Furthermore, for many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparentlymarshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity,compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems.As an application of the in-place transposition methodology, a novel approach to laying out arrays of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achievedby human programmers using discrete arrays today.Second, the system, DL, has a run-time library implemented in OpenCLthat transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling library. Third, we show experimentalresults that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling costis taken into account.

【预览】

附件列表
Files	Size	Format	View
Data layout transformation through in-place transposition	5506KB	PDF	download


Data layout transformation through in-place transposition
Transposition;In-place;Graphics Processing Unit (GPU);Data layout;Compilers
Sung, I-Jui
关键词: Transposition; In-place; Graphics Processing Unit (GPU); Data layout; Compilers;
Others : https://www.ideals.illinois.edu/bitstream/handle/2142/44300/I-Jui_Sung.pdf?sequence=1&isAllowed=y
美国\|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF


	文献评价指标
	下载次数：20次	浏览次数：29次

【 摘 要 】

【 预 览 】

【摘要】

【预览】