期刊论文详细信息
ETRI Journal
Exploiting Thread-Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline
关键词: MLEP;    CMP;    SMT;    TLP;    ILP;   
Others  :  1185660
DOI  :  10.4218/etrij.08.0107.0343
PDF
【 摘 要 】

In most parallel loops of embedded applications, everyiteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.

【 授权许可】

   

【 预 览 】
附件列表
Files Size Format View
20150520113301136.pdf 519KB PDF download
【 参考文献 】
  • [1]H.C. Hunter and J.H. Moreno, "A New Look at Exploiting Data Parallelism in Embedded Systems," CASE, 2003, pp. 159-169.
  • [2]I. Karkowski and H. Corporaal, "Exploiting Fine- and Coarse-Grain Parallelism in Embedded Programs," PACT, 1998, pp. 60-67.
  • [3]J.E. Smith and G.S. Sohi, "The Microarchitecture of Superscalar Processors," Proc. of the IEEE, vol. 83, Dec. 1995, pp.1609-1624.
  • [4]D.M. Tullsen et al., "Simultaneous Multithreading: Maximizing On-Chip Parallelism," ISCA-22, June 1995.
  • [5]Analog Devices, Inc. ADSP-BF561 Blackfin Embedded Symmetric Multiprocessor Rev. 0.
  • [6]ARM. ARM11 MPCore. http://www.arm.com/.
  • [7]EEMBC (EDN Embedded Microprocessor Benchmark Consortium). http://www.eembc.org.
  • [8]J. Oh et al., "OpenMP and Compilation Issue in Embedded Applications," LNCS, vol. 2716, June 2003, pp. 109-121.
  • [9]Extendable Instruction Set Computer. http://www.adc.co.kr.
  • [10]A. Eichenberger et al., "A Tutorial on BG/L Dual FPU Simdization," BlueGen System Software Workshop, 2005.
  • [11]C. Kozyrakis and D. Patterson, "Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks," MICRO-35, 2002, pp. 283-293.
  • [12]D. Talla et al., "Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW, and Superscalar Architectures," ICCD, 2000, pp. 163-172.
  • [13]OpenMP Forum, http://www.openmp.org/. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, Oct. 1997.
  • [14]M. Sato et al., "Design of OpenMP Compiler for an SMP Cluster," EWOMP, Sept. 1999, pp. 32-39.
  • [15]H.G. Nguyen, S.J. Hwang, and S.W. Kim, "Compiler Construction for Lockstep Execution of Multithreaded Processors," CIT, 2007, pp. 829-834.
  • [16]J.L. Lo et al., "Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading," ACM Trans. Computer Systems, vol. 15, no. 3, 1997, pp. 322-354.
  • [17]J. Collins and D. Tullsen, "Clustered Multithreaded Architectures: Pursuing both IPC and Cycle Time," IPDPS, 2004, pp. 766-775.
  • [18]H. Zhong, S.A. Lieberman, and S.A. Mahlke, "Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications, HPCA, Feb. 2007, pp. 25-36.
  • [19]J.R. Nickols, "The Design of the MasPar MP-1: A Cost Effective Massively Parallel Computer," IEEE COMPCON, Spring 1990, pp. 25-28.
  • [20]W.W.L. Fung et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO, Dec. 2007, pp. 407-420.
  • [21]T.R. Halfhill, "Parallel Processing With CUDA," Microprocessor Report, Jan. 2008.
  • [22]GeForce Family, http://www.nvidia.com/page/geforce8.html.
  文献评价指标  
  下载次数:8次 浏览次数:36次