{"title":"Empirical Study on Efficiency of Different Language Modeling Techniques using Masking of Named Entities for Indic Languages","authors":"Sravan Kumar Reddy, Shailashree K Sheshadri, Krishna Likith Avatapalli, Deepa Gupta","doi":"10.1016/j.procs.2025.04.228","DOIUrl":null,"url":null,"abstract":"<div><div>Processing unstructured text in Natural Language Processing (NLP) poses significant challenges for Indic languages, which feature flexible word order, spelling variations, and complex sentence structures. Traditional models often struggle with these complexities, leading to issues such as out-of-vocabulary (OOV) words and increased perplexity. Neural Language Models (NLMs), particularly transformer-based models, address some of these challenges by employing word representations and self-attention mechanisms. However, OOV problems persist, especially with named entities, which are dynamic and vary across domains, making it difficult to create comprehensive lists of names for people, organizations, and locations. To address this, the Masked Entity-Based Language Model (ME-LM) has been introduced, focusing on masking named entities identified through Named Entity Recognition (NER) using pre-trained models like BERT-base-NER and IndicNER. Applied to Indic languages such as Hindi, Kannada, and Telugu for the first time, ME-LM has significantly reduced OOV occurrences by 18.60% to 94.70% and lowered perplexity. Since this is the first application of ME-LM to these languages, no standard benchmark exists for direct comparison, but the results show strong potential for improving named entity handling in these languages.</div></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"258 ","pages":"Pages 146-159"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050925013304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Processing unstructured text in Natural Language Processing (NLP) poses significant challenges for Indic languages, which feature flexible word order, spelling variations, and complex sentence structures. Traditional models often struggle with these complexities, leading to issues such as out-of-vocabulary (OOV) words and increased perplexity. Neural Language Models (NLMs), particularly transformer-based models, address some of these challenges by employing word representations and self-attention mechanisms. However, OOV problems persist, especially with named entities, which are dynamic and vary across domains, making it difficult to create comprehensive lists of names for people, organizations, and locations. To address this, the Masked Entity-Based Language Model (ME-LM) has been introduced, focusing on masking named entities identified through Named Entity Recognition (NER) using pre-trained models like BERT-base-NER and IndicNER. Applied to Indic languages such as Hindi, Kannada, and Telugu for the first time, ME-LM has significantly reduced OOV occurrences by 18.60% to 94.70% and lowered perplexity. Since this is the first application of ME-LM to these languages, no standard benchmark exists for direct comparison, but the results show strong potential for improving named entity handling in these languages.