NepBERTa: Nepali Language Model Trained in a Large Corpus

Sulav Timilsina, Milan Gautam, Binod Bhattarai

Research output: Chapter in Book/Report/Conference proceedingPublished conference contribution

Abstract

Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.
Original languageEnglish
Title of host publicationProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
Subtitle of host publication(Volume 2: Short Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages273-284
Number of pages12
Volume2
ISBN (Electronic)978-1-955917-64-3
Publication statusPublished - 20 Nov 2022
Externally publishedYes
EventAACL-IJCNLP 2022 - Online Event
Duration: 20 Nov 202223 Nov 2022
https://aaclweb.org/

Conference

ConferenceAACL-IJCNLP 2022
Abbreviated titleAACL-IJCNLP 2022
Period20/11/2223/11/22
Internet address

Bibliographical note

We would like to thank Google’s TPU Research Cloud program for providing us with free and unlimited usage of TPU v3-128 for 90 days. It would not have been possible without the continuous support and response of the TRC team.

Fingerprint

Dive into the research topics of 'NepBERTa: Nepali Language Model Trained in a Large Corpus'. Together they form a unique fingerprint.

Cite this