| Journal of Big Data | |
| Pre-trained transformer-based language models for Sundanese | |
| Derwin Suhartono1  Wilson Wongso1  Henry Lucky1  | |
| [1] Computer Science Department, School of Computer Science, Bina Nusantara University; | |
| 关键词: Sundanese Language; Transformers; Natural Language Understanding; Low-resource Language; | |
| DOI : 10.1186/s40537-022-00590-7 | |
| 来源: DOAJ | |
【 摘 要 】
Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.
【 授权许可】
Unknown