{"title":"BanglaLem:一个基于转换器的孟加拉语词汇分析器,具有增强的数据集","authors":"Md Fuadul Islam, Jakir Hasan, Md Ashikul Islam, Prato Dewan, M. Shahidur Rahman","doi":"10.1016/j.sasc.2025.200244","DOIUrl":null,"url":null,"abstract":"<div><div>Lemmatization plays a crucial role in various natural language processing (NLP) tasks, such as information retrieval, sentiment analysis, text summarization, and text classification. However, Bangla lemmatization remains particularly challenging due to the language’s rich morphology and high inflectional complexity. Existing open-access datasets for Bangla lemmatization are limited in size, with the largest containing only 22353 unique inflected words, which constrains the effectiveness of data-driven neural models. To address this limitation, we introduce a novel dataset, BanglaLem, comprising 96040 frequently used inflected words. This dataset has been carefully curated and annotated through a rigorous selection process to enhance the accuracy and efficiency of Bangla lemmatization. Furthermore, we propose a transformer-based approach to lemmatization and evaluate the performance of various pre-trained and trained from-scratch transformer models on this dataset. Among these, the BanglaT5 model achieved the highest exact match accuracy of 94.42% on the test set. The BanglaLem dataset is publicly accessible via the following <span><span>link</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":101205,"journal":{"name":"Systems and Soft Computing","volume":"7 ","pages":"Article 200244"},"PeriodicalIF":3.6000,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BanglaLem: A Transformer-based Bangla Lemmatizer with an Enhanced Dataset\",\"authors\":\"Md Fuadul Islam, Jakir Hasan, Md Ashikul Islam, Prato Dewan, M. Shahidur Rahman\",\"doi\":\"10.1016/j.sasc.2025.200244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Lemmatization plays a crucial role in various natural language processing (NLP) tasks, such as information retrieval, sentiment analysis, text summarization, and text classification. However, Bangla lemmatization remains particularly challenging due to the language’s rich morphology and high inflectional complexity. Existing open-access datasets for Bangla lemmatization are limited in size, with the largest containing only 22353 unique inflected words, which constrains the effectiveness of data-driven neural models. To address this limitation, we introduce a novel dataset, BanglaLem, comprising 96040 frequently used inflected words. This dataset has been carefully curated and annotated through a rigorous selection process to enhance the accuracy and efficiency of Bangla lemmatization. Furthermore, we propose a transformer-based approach to lemmatization and evaluate the performance of various pre-trained and trained from-scratch transformer models on this dataset. Among these, the BanglaT5 model achieved the highest exact match accuracy of 94.42% on the test set. The BanglaLem dataset is publicly accessible via the following <span><span>link</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":101205,\"journal\":{\"name\":\"Systems and Soft Computing\",\"volume\":\"7 \",\"pages\":\"Article 200244\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systems and Soft Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772941925000626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systems and Soft Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772941925000626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
BanglaLem: A Transformer-based Bangla Lemmatizer with an Enhanced Dataset
Lemmatization plays a crucial role in various natural language processing (NLP) tasks, such as information retrieval, sentiment analysis, text summarization, and text classification. However, Bangla lemmatization remains particularly challenging due to the language’s rich morphology and high inflectional complexity. Existing open-access datasets for Bangla lemmatization are limited in size, with the largest containing only 22353 unique inflected words, which constrains the effectiveness of data-driven neural models. To address this limitation, we introduce a novel dataset, BanglaLem, comprising 96040 frequently used inflected words. This dataset has been carefully curated and annotated through a rigorous selection process to enhance the accuracy and efficiency of Bangla lemmatization. Furthermore, we propose a transformer-based approach to lemmatization and evaluate the performance of various pre-trained and trained from-scratch transformer models on this dataset. Among these, the BanglaT5 model achieved the highest exact match accuracy of 94.42% on the test set. The BanglaLem dataset is publicly accessible via the following link.