Proceedings of the Workshop on Multilingual Information Access (MIA)最新文献

An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language 低资源孟加拉语语篇模式识别的标注数据集与自动方法

Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.2

Salim Sazzed

{"title":"An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language","authors":"Salim Sazzed","doi":"10.18653/v1/2022.mia-1.2","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.2","url":null,"abstract":"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123843181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Zero-shot cross-lingual open domain question answering 零机会跨语言开放领域问答

Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.9

Sumit Agarwal, Suraj Tripathi, T. Mitamura, C. Rosé

引用次数: 1

Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification 越南语复词识别:走向越南语文本简化

Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.6

Phuong-Thai Nguyen, David Kauchak

引用次数: 0

Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms 基于语言无关的虚拟助手平台意图分类基准

Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.7

Gengyu Wang, Cheng Qian, Lin Pan, Haode Qi, L. Kunc, Saloni Potdar

引用次数: 2