期刊论文详细信息
Defence science journal
Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform
article
Shruti Rawal1  Indivar Gupta1 
[1] DRDO - Scientific Analysis Group
关键词: GF(2);    GPGPU computing;    MPI;    CUDA;    Block Wiedemann Algorithm;    NVidia V100 GPU;    NVidia DGX station;    HPC cluster;   
DOI  :  10.14429/dsj.72.17656
学科分类:社会科学、人文和艺术(综合)
来源: Defence Scientific Information & Documentation Centre
PDF
【 摘 要 】

We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU.

【 授权许可】

All Rights reserved   

【 预 览 】
附件列表
Files Size Format View
RO202306290004190ZK.pdf 1405KB PDF download
  文献评价指标  
  下载次数:22次 浏览次数:1次