NepBERTa: Nepali Language Model Trained in a Large Corpus

Sulav Timilsina; Milan Gautam; Binod Bhattarai

NepBERTa: Nepali Language Model Trained in a Large Corpus

Sulav Timilsina, Milan Gautam, Binod Bhattarai

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Abstract

Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.

Original language	English
Title of host publication	Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
Subtitle of host publication	(Volume 2: Short Papers)
Publisher	Association for Computational Linguistics (ACL)
Pages	273-284
Number of pages	12
Volume	2
ISBN (Electronic)	978-1-955917-64-3
Publication status	Published - 20 Nov 2022
Externally published	Yes
Event	AACL-IJCNLP 2022 - Online Event Duration: 20 Nov 2022 → 23 Nov 2022 https://aaclweb.org/

Conference

Conference	AACL-IJCNLP 2022
Abbreviated title	AACL-IJCNLP 2022
Period	20/11/22 → 23/11/22
Internet address	https://aaclweb.org/

Bibliographical note

We would like to thank Google’s TPU Research Cloud program for providing us with free and unlimited usage of TPU v3-128 for 90 days. It would not have been possible without the continuous support and response of the TRC team.

Access to Document

Timilsina_etal_ACLA_NepNERTa_VOR
© 1963–2023 ACL; other materials are copyrighted by their respective copyright holders. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/
Final published version, 3.59 MBLicence: CC BY

https://aclanthology.org/2022.aacl-short.34Licence: CC BY

Cite this

Timilsina, S., Gautam, M., & Bhattarai, B. (2022). NepBERTa: Nepali Language Model Trained in a Large Corpus. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: (Volume 2: Short Papers) (Vol. 2, pp. 273-284). Association for Computational Linguistics (ACL). https://aclanthology.org/2022.aacl-short.34

NepBERTa: Nepali Language Model Trained in a Large Corpus. / Timilsina, Sulav; Gautam, Milan; Bhattarai, Binod.
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: (Volume 2: Short Papers). Vol. 2 Association for Computational Linguistics (ACL), 2022. p. 273-284.

Research output: Chapter in Book/Report/Conference proceeding › Published conference contribution

Timilsina, S, Gautam, M & Bhattarai, B 2022, NepBERTa: Nepali Language Model Trained in a Large Corpus. in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: (Volume 2: Short Papers). vol. 2, Association for Computational Linguistics (ACL), pp. 273-284, AACL-IJCNLP 2022, 20/11/22. <https://aclanthology.org/2022.aacl-short.34>

Timilsina S, Gautam M, Bhattarai B. NepBERTa: Nepali Language Model Trained in a Large Corpus. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: (Volume 2: Short Papers). Vol. 2. Association for Computational Linguistics (ACL). 2022. p. 273-284 Epub 2022 Nov 20.

Timilsina, Sulav ; Gautam, Milan ; Bhattarai, Binod. / NepBERTa : Nepali Language Model Trained in a Large Corpus. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: (Volume 2: Short Papers). Vol. 2 Association for Computational Linguistics (ACL), 2022. pp. 273-284

@inproceedings{284fc1cb2eee4e5dab2c2ca859b402bd,

title = "NepBERTa: Nepali Language Model Trained in a Large Corpus",

abstract = "Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven{\textquoteright}t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.",

author = "Sulav Timilsina and Milan Gautam and Binod Bhattarai",

note = "We would like to thank Google{\textquoteright}s TPU Research Cloud program for providing us with free and unlimited usage of TPU v3-128 for 90 days. It would not have been possible without the continuous support and response of the TRC team.; AACL-IJCNLP 2022, AACL-IJCNLP 2022 ; Conference date: 20-11-2022 Through 23-11-2022",

year = "2022",

month = nov,

day = "20",

language = "English",

volume = "2",

pages = "273--284",

booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",

publisher = "Association for Computational Linguistics (ACL)",

url = "https://aaclweb.org/",

}

TY - GEN

T1 - NepBERTa

T2 - AACL-IJCNLP 2022

AU - Timilsina, Sulav

AU - Gautam, Milan

AU - Bhattarai, Binod

N1 - We would like to thank Google’s TPU Research Cloud program for providing us with free and unlimited usage of TPU v3-128 for 90 days. It would not have been possible without the continuous support and response of the TRC team.

PY - 2022/11/20

Y1 - 2022/11/20

N2 - Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.

AB - Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.

M3 - Published conference contribution

VL - 2

SP - 273

EP - 284

BT - Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing

PB - Association for Computational Linguistics (ACL)

Y2 - 20 November 2022 through 23 November 2022

ER -

NepBERTa: Nepali Language Model Trained in a Large Corpus

Abstract

Conference

Bibliographical note

Access to Document

Fingerprint

Cite this