{"title":"ViMRC - VLSP 2021: Using XLM-RoBERTa and Filter Output for Vietnamese Machine Reading Comprehension","authors":"Văn Nhân Đặng, Minh Le Nguyen","doi":"10.25073/2588-1086/vnucsce.336","DOIUrl":null,"url":null,"abstract":"Machine Reading Comprehension (MRC) has recently made significant progress. This paper is the result of our participation in building an MRC system specifically for Vietnamese on Vietnamese Machine Reading Comprehension at the 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021). Based on SQuAD2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD2.0, a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model called XLM-RoBERTa (XLM-R), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and select the answers with the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"18 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"VNU Journal of Science: Computer Science and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25073/2588-1086/vnucsce.336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine Reading Comprehension (MRC) has recently made significant progress. This paper is the result of our participation in building an MRC system specifically for Vietnamese on Vietnamese Machine Reading Comprehension at the 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021). Based on SQuAD2.0, the organizing committee developed the Vietnamese Question Answering Dataset UIT-ViQuAD2.0, a reading comprehension dataset consisting of questions posed by crowd-workers on a set of Wikipedia Vietnamese articles. The UIT-ViQuAD2.0 dataset evolved from version 1.0 with the difference that version 2.0 contained answerable and unanswerable questions. The challenge of this problem is to distinguish between answerable and unanswerable questions. The answer to every question is a span of text, from the corresponding reading passage, or the question might be unanswerable. Our system employs simple yet highly effective methods. The system uses a pre-trained language model called XLM-RoBERTa (XLM-R), combined with filtering results from multiple output files to produce the final result. We created about 5-7 output files and select the answers with the most repetitions as the final prediction answer. After filtering, our system increased from 75.172% to 76.386% at the F1 measure and achieved 65,329% in the EM measure on the Private Test set.