ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension

Văn Nhân Đặng, Minh Le Nguyen
{"title":"ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension","authors":"Văn Nhân Đặng, Minh Le Nguyen","doi":"10.25073/2588-1086/vnucsce.336","DOIUrl":null,"url":null,"abstract":"Machine Reading Comprehension (MRC) has recently made significant progress. This paper is the result of our participation in building an MRC system specifically for Vietnamese on Vietnamese Machine Reading Comprehension at the 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021). Based on SQuAD2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD2.0, a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model called XLM-RoBERTa (XLM-R), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and select the answers with the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"18 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"VNU Journal of Science: Computer Science and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25073/2588-1086/vnucsce.336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine Reading Comprehension (MRC) has recently made significant progress. This paper is the result of our participation in building an MRC system specifically for Vietnamese on Vietnamese Machine Reading Comprehension at the 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021). Based on SQuAD2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD2.0, a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model called XLM-RoBERTa (XLM-R), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and select the answers with the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set.
ViMRC - VLSP 2021:使用XLM-RoBERTa和过滤器输出进行越南语机器阅读理解
机器阅读理解(MRC)最近取得了重大进展。这篇论文是我们在第八届越南语语言和语音处理国际研讨会(VLSP 2021)上参与构建越南语机器阅读理解MRC系统的结果。在SQuAD2.0的基础上,组委会开发了越南问答数据集unit - viquad2.0,这是一个阅读理解数据集,由一组维基百科越南文文章上的人群工作者提出的问题组成。unit - viquad2.0数据集从1.0版本演变而来,不同之处在于2.0版本包含了可回答和不可回答的问题。这个问题的挑战在于区分可回答和不可回答的问题。每个问题的答案都是一段文字,来自相应的阅读文章,否则问题可能无法回答。我们的系统采用简单而高效的方法。该系统使用一种称为XLM-RoBERTa (XLM-R)的预训练语言模型,结合从多个输出文件中过滤结果来产生最终结果。我们创建了大约5-7个输出文件,并选择重复次数最多的答案作为最终的预测答案。经过滤波后,我们的系统在F1测度上从75.172%提高到76.386%,在Private Test集的EM测度上达到65329%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信