Data Science and Engineering | |
Scaling Word2Vec on Big Corpus | |
  1    2    3    3    3    4  | |
[1] 0000 0001 2179 2105, grid.32197.3e, Tokyo Institute of Technology, Tokyo, Japan;AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory, Tokyo, Japan;RIKEN Center for Computational Science, Kobe, Japan;0000 0001 2179 2105, grid.32197.3e, Tokyo Institute of Technology, Tokyo, Japan;RIKEN Center for Computational Science, Kobe, Japan;0000 0004 0368 8103, grid.24539.39, Renmin University of China, Beijing, China;0000 0004 0368 8103, grid.24539.39, Renmin University of China, Beijing, China;0000 0001 2179 2105, grid.32197.3e, Tokyo Institute of Technology, Tokyo, Japan; | |
关键词: Machine learning; Natural language processing; High performance computing; Word embeddings; | |
DOI : 10.1007/s41019-019-0096-6 | |
来源: publisher | |
【 摘 要 】
Word embedding has been well accepted as an important feature in the area of natural language processing (NLP). Specifically, the Word2Vec model learns high-quality word embeddings and is widely used in various NLP tasks. The training of Word2Vec is sequential on a CPU due to strong dependencies between word–context pairs. In this paper, we target to scale Word2Vec on a GPU cluster. To do this, one main challenge is reducing dependencies inside a large training batch. We heuristically design a variation of Word2Vec, which ensures that each word–context pair contains a non-dependent word and a uniformly sampled contextual word. During batch training, we “freeze” the context part and update only on the non-dependent part to reduce conflicts. This variation also directly controls the training iterations by fixing the number of samples and treats high-frequency and low-frequency words equally. We conduct extensive experiments over a range of NLP tasks. The results show that our proposed model achieves a 7.5 times acceleration on 16 GPUs without accuracy drop. Moreover, by using high-level Chainer deep learning framework, we can easily implement Word2Vec variations such as CNN-based subword-level models and achieves similar scaling results.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201910108938975ZK.pdf | 2288KB | download |