Proceedings of the Workshop on Multilingual Information Access (MIA)最新文献

筛选
英文 中文
An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language 低资源孟加拉语语篇模式识别的标注数据集与自动方法
Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.2
Salim Sazzed
{"title":"An Annotated Dataset and Automatic Approaches for Discourse Mode Identification in Low-resource Bengali Language","authors":"Salim Sazzed","doi":"10.18653/v1/2022.mia-1.2","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.2","url":null,"abstract":"The modes of discourse aid in comprehending the convention and purpose of various forms of languages used during communication. In this study, we introduce a discourse mode annotated corpus for the low-resource Bangla (also referred to as Bengali) language. The corpus consists of sentence-level annotation of three different discourse modes, narrative, descriptive, and informative of the text excerpted from a number of Bangla novels. We analyze the annotated corpus to expose various linguistic aspects of discourse modes, such as class distributions and average sentence lengths. To automatically determine the mode of discourse, we apply CML (classical machine learning) classifiers with n-gram based statistical features and a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) based language model. We observe that fine-tuned BERT-based approach yields more promising results than n-gram based CML classifiers. Our created discourse mode annotated dataset, the first of its kind in Bangla, and the evaluation, provide baselines for the automatic discourse mode identification in Bangla and can assist various downstream natural language processing tasks.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123843181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Zero-shot cross-lingual open domain question answering 零机会跨语言开放领域问答
Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.9
Sumit Agarwal, Suraj Tripathi, T. Mitamura, C. Rosé
{"title":"Zero-shot cross-lingual open domain question answering","authors":"Sumit Agarwal, Suraj Tripathi, T. Mitamura, C. Rosé","doi":"10.18653/v1/2022.mia-1.9","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.9","url":null,"abstract":"People speaking different kinds of languages search for information in a cross-lingual manner. They tend to ask questions in their language and expect the answer to be in the same language, despite the evidence lying in another language. In this paper, we present our approach for this task of cross-lingual open-domain question-answering. Our proposed method employs a passage reranker, the fusion-in-decoder technique for generation, and a wiki data entity-based post-processing system to tackle the inability to generate entities across all languages. Our end-2-end pipeline shows an improvement of 3 and 4.6 points on F1 and EM metrics respectively, when compared with the baseline CORA model on the XOR-TyDi dataset. We also evaluate the effectiveness of our proposed techniques in the zero-shot setting using the MKQA dataset and show an improvement of 5 points in F1 for high-resource and 3 points improvement for low-resource zero-shot languages. Our team, CMUmQA’s submission in the MIA-Shared task ranked 1st in the constrained setup for the dev and 2nd in the test setting.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115959853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification 越南语复词识别:走向越南语文本简化
Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.6
Phuong-Thai Nguyen, David Kauchak
{"title":"Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification","authors":"Phuong-Thai Nguyen, David Kauchak","doi":"10.18653/v1/2022.mia-1.6","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.6","url":null,"abstract":"Text Simplification has been an extensively researched problem in English, but has not been investigated in Vietnamese. We focus on the Vietnamese-specific Complex Word Identification task, often the first step in Lexical Simplification (Shardlow, 2013). We examine three different Vietnamese datasets constructed for other Natural Language Processing tasks and show that, like in other languages, frequency is a strong signal in determining whether a word is complex, with a mean accuracy of 86.87%. Across the datasets, we find that the 10% most frequent words in many corpus can be labelled as simple, and the rest as complex, though this is more variable for smaller corpora. We also examine how human annotators perform at this task. Given the subjective nature, there is a fair amount of variability in which words are seen as difficult, though majority results are more consistent.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131242106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms 基于语言无关的虚拟助手平台意图分类基准
Proceedings of the Workshop on Multilingual Information Access (MIA) Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.mia-1.7
Gengyu Wang, Cheng Qian, Lin Pan, Haode Qi, L. Kunc, Saloni Potdar
{"title":"Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms","authors":"Gengyu Wang, Cheng Qian, Lin Pan, Haode Qi, L. Kunc, Saloni Potdar","doi":"10.18653/v1/2022.mia-1.7","DOIUrl":"https://doi.org/10.18653/v1/2022.mia-1.7","url":null,"abstract":"Current virtual assistant (VA) platforms are beholden to the limited number of languages they support. Every component, such as the tokenizer and intent classifier, is engineered for specific languages in these intricate platforms. Thus, supporting a new language in such platforms is a resource-intensive operation requiring expensive re-training and re-designing. In this paper, we propose a benchmark for evaluating language-agnostic intent classification, the most critical component of VA platforms. To ensure the benchmarking is challenging and comprehensive, we include 29 public and internal datasets across 10 low-resource languages and evaluate various training and testing settings with consideration of both accuracy and training time. The benchmarking result shows that Watson Assistant, among 7 commercial VA platforms and pre-trained multilingual language models (LMs), demonstrates close-to-best accuracy with the best accuracy-training time trade-off.","PeriodicalId":333865,"journal":{"name":"Proceedings of the Workshop on Multilingual Information Access (MIA)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121182308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信