OdNER: NER resource creation and system development for low-resource Odia language

Tusarkanta Dalai , Anupam Das , Tapas Kumar Mishra , Pankaj Kumar Sa
{"title":"OdNER: NER resource creation and system development for low-resource Odia language","authors":"Tusarkanta Dalai ,&nbsp;Anupam Das ,&nbsp;Tapas Kumar Mishra ,&nbsp;Pankaj Kumar Sa","doi":"10.1016/j.nlp.2025.100139","DOIUrl":null,"url":null,"abstract":"<div><div>This work aims to enhance the usability of natural language processing (NLP) based systems for the low-resource Odia language by focusing on the development of effective named entity recognition (NER) system. NLP applications rely heavily on NER to extract relevant information from massive amounts of unstructured text. The task of identifying and classifying the named entities included in a given text into a set of predetermined categories is referred to as NER. Already, the NER task has accomplished productive results in English as well as in a number of other European languages. On the other hand, because of a lack of supporting tools and resources, it has not yet been thoroughly investigated in Indian languages, particularly the Odia language. Recently, approaches based on machine learning (ML) and deep learning (DL) have demonstrated exceptional performance when it comes to constructing NLP tasks. Moreover, transformer models, particularly masked-language models (MLM), have demonstrated remarkable efficacy in the NER task; nevertheless, these methods generally call for massive volumes of annotated corpus. Unfortunately, we could not find any open-source NER corpus for the Odia language. The purpose of this research is to compile OdNER, a NER dataset with quality baselines for the low-resource Odia language. The Odia NER corpus OdNER contains 48,000 sentences having 6,71,354 tokens and 98,116 name entities annotated with 12 tags. To establish the quality of our corpus, we use conditional random field (CRF) and BiLSTM model as our baseline models. To demonstrate the efficacy of our dataset, we conduct a comparative evaluation of various transformer-based multilingual language models (IndicBERT, MuRIL, XLM-R) and utilize them to carry out the sequence labeling task for NER. With the pre-trained XLM-R multilingual model, our dataset achieves a maximum F1 score of 90.48%. When it comes to Odia NER, no other work comes close to matching the quality and quantity of ours. We anticipate that, this work will have made substantial progress toward the development of NLP tasks for the Odia language.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"11 ","pages":"Article 100139"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This work aims to enhance the usability of natural language processing (NLP) based systems for the low-resource Odia language by focusing on the development of effective named entity recognition (NER) system. NLP applications rely heavily on NER to extract relevant information from massive amounts of unstructured text. The task of identifying and classifying the named entities included in a given text into a set of predetermined categories is referred to as NER. Already, the NER task has accomplished productive results in English as well as in a number of other European languages. On the other hand, because of a lack of supporting tools and resources, it has not yet been thoroughly investigated in Indian languages, particularly the Odia language. Recently, approaches based on machine learning (ML) and deep learning (DL) have demonstrated exceptional performance when it comes to constructing NLP tasks. Moreover, transformer models, particularly masked-language models (MLM), have demonstrated remarkable efficacy in the NER task; nevertheless, these methods generally call for massive volumes of annotated corpus. Unfortunately, we could not find any open-source NER corpus for the Odia language. The purpose of this research is to compile OdNER, a NER dataset with quality baselines for the low-resource Odia language. The Odia NER corpus OdNER contains 48,000 sentences having 6,71,354 tokens and 98,116 name entities annotated with 12 tags. To establish the quality of our corpus, we use conditional random field (CRF) and BiLSTM model as our baseline models. To demonstrate the efficacy of our dataset, we conduct a comparative evaluation of various transformer-based multilingual language models (IndicBERT, MuRIL, XLM-R) and utilize them to carry out the sequence labeling task for NER. With the pre-trained XLM-R multilingual model, our dataset achieves a maximum F1 score of 90.48%. When it comes to Odia NER, no other work comes close to matching the quality and quantity of ours. We anticipate that, this work will have made substantial progress toward the development of NLP tasks for the Odia language.
OdNER:低资源Odia语言的NER资源创建和系统开发
本工作旨在通过开发有效的命名实体识别(NER)系统,提高基于自然语言处理(NLP)的系统对低资源Odia语言的可用性。NLP应用严重依赖于NER从大量非结构化文本中提取相关信息。将给定文本中包含的命名实体识别和分类为一组预定类别的任务称为NER。NER任务已经在英语和其他一些欧洲语言中取得了丰硕的成果。另一方面,由于缺乏支持工具和资源,它还没有在印度语言中进行彻底的调查,特别是奥迪亚语。最近,基于机器学习(ML)和深度学习(DL)的方法在构建NLP任务时表现出了卓越的性能。此外,变压器模型,特别是掩模语言模型(MLM),在NER任务中表现出显著的有效性;然而,这些方法通常需要大量带注释的语料库。不幸的是,我们找不到Odia语言的任何开源NER语料库。本研究的目的是编译OdNER,这是一个具有低资源Odia语言质量基线的NER数据集。Odia NER语料库OdNER包含48,000个句子,有6,71,354个令牌和98,116个名称实体,用12个标签注释。为了建立语料库的质量,我们使用条件随机场(CRF)和BiLSTM模型作为我们的基线模型。为了证明我们的数据集的有效性,我们对各种基于转换器的多语言模型(IndicBERT, MuRIL, XLM-R)进行了比较评估,并利用它们执行NER的序列标记任务。使用预训练的XLM-R多语言模型,我们的数据集达到了90.48%的最高F1分数。说到印度国家铁路,没有其他工作能与我们的质量和数量相媲美。我们预计,这项工作将在Odia语言的NLP任务开发方面取得实质性进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信