期刊论文详细信息
Journal of Big Data
Pre-trained transformer-based language models for Sundanese
Derwin Suhartono1  Wilson Wongso1  Henry Lucky1 
[1] Computer Science Department, School of Computer Science, Bina Nusantara University;
关键词: Sundanese Language;    Transformers;    Natural Language Understanding;    Low-resource Language;   
DOI  :  10.1186/s40537-022-00590-7
来源: DOAJ
【 摘 要 】

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

【 授权许可】

Unknown   

  文献评价指标  
  下载次数:0次 浏览次数:12次