期刊论文详细信息
IEEE Access
Row-Wise Product-Based Sparse Matrix Multiplication Hardware Accelerator With Optimal Load Balancing
Arslan Munir1  Jong Hun Lee2  Beomjin Park3  Joonho Kong4 
[1] Department of Computer Science, Kansas State University, Manhattan, KS, USA;LX Semicon, Seoul, South Korea;Samsung Electronics, Hwaseong, South Korea;School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, South Korea;
关键词: Sparse matrix multiplication;    row-wise product;    load balancing;    matrix tiling;    speedup;   
DOI  :  10.1109/ACCESS.2022.3184116
来源: DOAJ
【 摘 要 】

Matrix multiplication is a main computation kernel of emerging workloads, such as deep neural networks and graph analytics. These workloads often exhibit high sparsity in data, which means a large portion of the elements in the data are zero-valued elements. Though systolic arrays have shown a significant performance and energy efficiency improvement over central processing units (CPUs) or graphic processing units (GPUs) when executing matrix multiplications, data sparsity is largely overlooked in the conventional systolic arrays. In this paper, we propose a row-wise product-based sparse matrix multiplication (SpMM) hardware accelerator for compressed sparse row (CSR)-formatted input matrices. Our hardware accelerator leverages row-wise product, which has advantages over inner-product or outer-product when executing the sparse matrix multiplications. As compared to the conventional systolic arrays, which cannot skip the ineffectual operations, our hardware accelerator only performs effectual operations with non-zero elements, improving the performance when executing SpMM. In addition, we also propose an optimal load balancing scheme when using multiple processing elements (PEs). Our load balancing scheme utilizes an operation count-based matrix tiling for parallel execution of the PEs and resource contention avoidance. According to our evaluation, our 32PE-SpMM accelerator shows $13.6\times $ $47.9\times $ speedup over tensor processing unit (TPU)-like systolic arrays, on average. Furthermore, our operation count-based load balancing scheme shows better performance over the fixed tiling and non-zero element count-based tiling by up to 8.48% and 6.28%, respectively, with only up to 0.06% matrix tiling pre-processing latency overhead.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:0次