期刊论文详细信息
Information
Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
Xiao Li1  Yang Yuan2  Ya-Ting Yang2 
[1] Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;;Xinjiang Technical Institute of Physics &
关键词: word embedding;    word alignment probability;    distance attenuation function;    word2vec;    glove;   
DOI  :  10.3390/info11010024
来源: DOAJ
【 摘 要 】

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:0次