Abstract
Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing |
Subtitle of host publication | (Volume 2: Short Papers) |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 273-284 |
Number of pages | 12 |
Volume | 2 |
ISBN (Electronic) | 978-1-955917-64-3 |
Publication status | Published - 20 Nov 2022 |
Externally published | Yes |
Event | AACL-IJCNLP 2022 - Online Event Duration: 20 Nov 2022 → 23 Nov 2022 https://aaclweb.org/ |
Conference
Conference | AACL-IJCNLP 2022 |
---|---|
Abbreviated title | AACL-IJCNLP 2022 |
Period | 20/11/22 → 23/11/22 |
Internet address |