A hybrid approach for Bengali sentence validation

IF 10.7 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence Review Pub Date : 2024-10-07 DOI:10.1007/s10462-024-10795-2

Juel Sikder, Prosenjit Chakraborty, Utpol Kanti Das, Krity Dhar

{"title":"A hybrid approach for Bengali sentence validation","authors":"Juel Sikder, Prosenjit Chakraborty, Utpol Kanti Das, Krity Dhar","doi":"10.1007/s10462-024-10795-2","DOIUrl":null,"url":null,"abstract":"<div><p>Bengali is the official language of Bangladesh and is widely used in Bangladesh and West Bengal in India. Due to the growing accessibility of the internet and smart devices, the use of digital text material and documents in Bengali is growing with time. An automated Bengali Sentence Validation System is proposed in this study to effectively determine the correctness of sentences in such extensively available Bengali content. As far as we know, no substantial work has been done in the field of Bengali Sentence Validation utilizing deep learning approaches. Due to the lack of linguistic resources, sophisticated Natural Language Processing tools, and benchmark datasets, developing an automated Sentence Validation System for a limited-resource language like Bengali is challenging. Additionally, Bengali Sentences come in two morphological varieties (Sadhu-bhasha and Cholito-bhasha), making the validation process more challenging. The proposed automated Bengali Sentence Validation system contains the CNN-BiLSTM hybrid classifier model. As of now, there is no standard dataset for Bengali sentence validation. Due to the lack of a standard dataset, we collected Bengali sentences from different sources in Bangladesh and developed a Bengali Sentence Validation (BSV) Dataset with around 5000 labelled sentences arranged into two categories such as correct and incorrect. Experimental results demonstrate that the proposed system outperformed other classifier models and existing approaches for Bengali Sentence Validation and is able to categorize a wide range of Bengali sentences based on their correctness. The system’s F1 score for the Bengali Sentence Validation is 98%. </p></div>","PeriodicalId":8449,"journal":{"name":"Artificial Intelligence Review","volume":"57 11","pages":""},"PeriodicalIF":10.7000,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10462-024-10795-2.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence Review","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10462-024-10795-2","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Bengali is the official language of Bangladesh and is widely used in Bangladesh and West Bengal in India. Due to the growing accessibility of the internet and smart devices, the use of digital text material and documents in Bengali is growing with time. An automated Bengali Sentence Validation System is proposed in this study to effectively determine the correctness of sentences in such extensively available Bengali content. As far as we know, no substantial work has been done in the field of Bengali Sentence Validation utilizing deep learning approaches. Due to the lack of linguistic resources, sophisticated Natural Language Processing tools, and benchmark datasets, developing an automated Sentence Validation System for a limited-resource language like Bengali is challenging. Additionally, Bengali Sentences come in two morphological varieties (Sadhu-bhasha and Cholito-bhasha), making the validation process more challenging. The proposed automated Bengali Sentence Validation system contains the CNN-BiLSTM hybrid classifier model. As of now, there is no standard dataset for Bengali sentence validation. Due to the lack of a standard dataset, we collected Bengali sentences from different sources in Bangladesh and developed a Bengali Sentence Validation (BSV) Dataset with around 5000 labelled sentences arranged into two categories such as correct and incorrect. Experimental results demonstrate that the proposed system outperformed other classifier models and existing approaches for Bengali Sentence Validation and is able to categorize a wide range of Bengali sentences based on their correctness. The system’s F1 score for the Bengali Sentence Validation is 98%.

查看原文本刊更多论文

孟加拉语句子验证的混合方法

孟加拉语是孟加拉国的官方语言，在孟加拉国和印度西孟加拉邦广泛使用。由于互联网和智能设备的普及，孟加拉语数字文本材料和文档的使用与日俱增。本研究提出了一个自动孟加拉语句子验证系统，以有效确定这些广泛使用的孟加拉语内容中句子的正确性。据我们所知，在孟加拉语句子验证领域还没有利用深度学习方法进行的实质性工作。由于缺乏语言资源、复杂的自然语言处理工具和基准数据集，为孟加拉语这种资源有限的语言开发自动句子验证系统具有挑战性。此外，孟加拉语句子有两种形态（Sadhu-bhasha 和 Cholito-bhasha），这使得验证过程更具挑战性。拟议的孟加拉语句子自动验证系统包含 CNN-BiLSTM 混合分类器模型。到目前为止，还没有孟加拉语句子验证的标准数据集。由于缺乏标准数据集，我们从孟加拉国的不同来源收集了孟加拉语句子，并开发了孟加拉语句子验证（BSV）数据集，其中包含约 5000 个标签句子，分为正确和错误两类。实验结果表明，所提出的系统在孟加拉语句子验证方面的表现优于其他分类器模型和现有方法，能够根据句子的正确性对各种孟加拉语句子进行分类。该系统在孟加拉语句子验证方面的 F1 得分为 98%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence Review 工程技术-计算机：人工智能

CiteScore

22.00

自引率

3.30%

发文量

194

审稿时长

5.3 months

期刊介绍： Artificial Intelligence Review, a fully open access journal, publishes cutting-edge research in artificial intelligence and cognitive science. It features critical evaluations of applications, techniques, and algorithms, providing a platform for both researchers and application developers. The journal includes refereed survey and tutorial articles, along with reviews and commentary on significant developments in the field.