Empirical Study on Efficiency of Different Language Modeling Techniques using Masking of Named Entities for Indic Languages

Procedia Computer Science Pub Date : 2025-01-01 DOI:10.1016/j.procs.2025.04.228

Sravan Kumar Reddy, Shailashree K Sheshadri, Krishna Likith Avatapalli, Deepa Gupta

{"title":"Empirical Study on Efficiency of Different Language Modeling Techniques using Masking of Named Entities for Indic Languages","authors":"Sravan Kumar Reddy, Shailashree K Sheshadri, Krishna Likith Avatapalli, Deepa Gupta","doi":"10.1016/j.procs.2025.04.228","DOIUrl":null,"url":null,"abstract":"<div><div>Processing unstructured text in Natural Language Processing (NLP) poses significant challenges for Indic languages, which feature flexible word order, spelling variations, and complex sentence structures. Traditional models often struggle with these complexities, leading to issues such as out-of-vocabulary (OOV) words and increased perplexity. Neural Language Models (NLMs), particularly transformer-based models, address some of these challenges by employing word representations and self-attention mechanisms. However, OOV problems persist, especially with named entities, which are dynamic and vary across domains, making it difficult to create comprehensive lists of names for people, organizations, and locations. To address this, the Masked Entity-Based Language Model (ME-LM) has been introduced, focusing on masking named entities identified through Named Entity Recognition (NER) using pre-trained models like BERT-base-NER and IndicNER. Applied to Indic languages such as Hindi, Kannada, and Telugu for the first time, ME-LM has significantly reduced OOV occurrences by 18.60% to 94.70% and lowered perplexity. Since this is the first application of ME-LM to these languages, no standard benchmark exists for direct comparison, but the results show strong potential for improving named entity handling in these languages.</div></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"258 ","pages":"Pages 146-159"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050925013304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Processing unstructured text in Natural Language Processing (NLP) poses significant challenges for Indic languages, which feature flexible word order, spelling variations, and complex sentence structures. Traditional models often struggle with these complexities, leading to issues such as out-of-vocabulary (OOV) words and increased perplexity. Neural Language Models (NLMs), particularly transformer-based models, address some of these challenges by employing word representations and self-attention mechanisms. However, OOV problems persist, especially with named entities, which are dynamic and vary across domains, making it difficult to create comprehensive lists of names for people, organizations, and locations. To address this, the Masked Entity-Based Language Model (ME-LM) has been introduced, focusing on masking named entities identified through Named Entity Recognition (NER) using pre-trained models like BERT-base-NER and IndicNER. Applied to Indic languages such as Hindi, Kannada, and Telugu for the first time, ME-LM has significantly reduced OOV occurrences by 18.60% to 94.70% and lowered perplexity. Since this is the first application of ME-LM to these languages, no standard benchmark exists for direct comparison, but the results show strong potential for improving named entity handling in these languages.

查看原文本刊更多论文

基于命名实体遮蔽的不同语言建模技术对印度语建模效率的实证研究

在自然语言处理（NLP）中处理非结构化文本对印度语提出了重大挑战，印度语具有灵活的词序，拼写变化和复杂的句子结构。传统模型经常与这些复杂性作斗争，导致诸如超出词汇表（OOV）的单词和增加的困惑等问题。神经语言模型（nlm），特别是基于变换的模型，通过使用词表示和自注意机制来解决这些挑战。然而，OOV问题仍然存在，特别是对于命名实体，它们是动态的，并且跨域变化，因此很难为人员、组织和位置创建全面的名称列表。为了解决这个问题，引入了基于屏蔽实体的语言模型（ME-LM），重点是使用BERT-base-NER和IndicNER等预训练模型，屏蔽通过命名实体识别（NER）识别的命名实体。首次应用于印地语、卡纳达语和泰卢固语等印度语言，ME-LM显著减少了OOV的出现，从18.60%减少到94.70%，降低了困惑度。由于这是ME-LM在这些语言中的第一个应用程序，因此没有标准基准可以进行直接比较，但是结果显示了在这些语言中改进命名实体处理的巨大潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Procedia Computer Science

CiteScore

4.50

自引率

0.00%

发文量