低资源孟加拉语语篇模式识别的标注数据集与自动方法

Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.mia-1.2

Salim Sazzed

{"title":"低资源孟加拉语语篇模式识别的标注数据集与自动方法","authors":"Salim Sazzed","doi":"10.18653/v1/2022.mia-1.2","DOIUrl":null,"url":null,"abstract":"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language\",\"authors\":\"Salim Sazzed\",\"doi\":\"10.18653/v1/2022.mia-1.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.\",\"PeriodicalId\":333865,\"journal\":{\"name\":\"Proceedings of the Workshop on Multilingual Information Access (MIA)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Multilingual Information Access (MIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.mia-1.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Multilingual Information Access (MIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.mia-1.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

话语模式有助于理解交际中使用的各种语言形式的惯例和目的。在这项研究中，我们为低资源的孟加拉语(也称为孟加拉语)引入了一个话语模式注释的语料库。该语料库由三种不同话语模式的句子级注释组成，即叙事性、描述性和信息性的文本节选自一些孟加拉小说。我们分析了标注的语料库，揭示了话语模式的各个语言方面，如类分布和平均句子长度。为了自动确定话语模式，我们应用了基于n-gram统计特征的CML(经典机器学习)分类器和基于微调的BERT(双向编码器表示)的语言模型。我们观察到基于bert的微调方法比基于n-gram的CML分类器产生更有希望的结果。我们创建的语篇模式标注数据集(首个在孟加拉语中创建的语篇模式标注数据集)和评估为孟加拉语的自动语篇模式识别提供了基线，并可以辅助各种下游自然语言处理任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language

The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Multilingual Information Access (MIA)

自引率

0.00%

发文量