IEEE Access | |
Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network Dataflows | |
Jooho Wang1  Chester Sungchung Park1  Sungkyung Park2  | |
[1] Department of Electrical and Electronics Engineering, Konkuk University, Seoul, South Korea;Department of Electronics Engineering, Pusan National University, Pusan, South Korea; | |
关键词: Convolutional neural networks (CNNs); data dependence graph; design space exploration (DSE); hardware accelerators; latency-insensitive controller; pre-RTL simulator; | |
DOI : 10.1109/ACCESS.2022.3146413 | |
来源: DOAJ |
【 摘 要 】
In this paper, a new pre-RTL simulator is proposed to predict the power, performance, and area of convolutional neural network (CNN) dataflows prior to register-transfer-level (RTL) design. In the simulator, a novel approach is adopted to implement a spatial data dependence graph (SDDG), which enables us to model a specific dataflow alongside inter-instruction dependencies by tracking the status of each processing element (PE). In addition, the proposed pre-RTL simulator makes it possible to evaluate the impact of memory constraints such as latency and bandwidth. The latency-insensitive and bandwidth-insensitive PE controllers assumed in the proposed pre-RTL simulator guarantee both functional correctness and maximum performance, regardless of memory constraints. In particular, it is shown that the optimal distribution method of local memory bandwidth can reduce the accelerator execution time by up to 37.6% compared with the equal distribution method. For weight stationary (WS) and row stationary (RS) dataflows, the accelerator performance closely depends on memory constraints. The simulation results also show that the relative performances of dataflows depend on the layer shape of the convolutional layer. For example, for an identical hardware area in a standard convolutional layer of AlexNet, WS dataflows do not provide any performance gain over RS dataflows when the memory latency is sufficiently high. In addition, WS dataflows cannot fully reuse the input activation, thereby increasing local memory accesses, since the number of weights loaded at a specific time is limited. Moreover, in a depth-wise convolutional layer of MobileNet, WS dataflows tend to outperform RS dataflows even in the presence of large memory latency. The source code is available on the GitHub repository:
【 授权许可】
Unknown