EVJVQA CHALLENGE: MULTILINGUAL VISUAL QUESTION ANSWERING

Nguyen, Ngan Luu-Thuy, Nguyen, Nghia Hieu, Vo, Duong T. D, Tran, Khanh Quoc, Van Nguyen, Kiet
{"title":"EVJVQA CHALLENGE: MULTILINGUAL VISUAL QUESTION ANSWERING","authors":"Nguyen, Ngan Luu-Thuy, Nguyen, Nghia Hieu, Vo, Duong T. D, Tran, Khanh Quoc, Van Nguyen, Kiet","doi":"10.15625/1813-9663/18157","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addition, there is no multilingual dataset targeting the visual content of a particular country with its own objects and cultural characteristics. To address the weakness, we provide the research community with a benchmark dataset named EVJVQA, including 33,000+ pairs of question-answer over three languages: Vietnamese, English, and Japanese, on approximately 5,000 images taken from Vietnam for evaluating multilingual VQA systems or models. EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022). This task attracted 62 participant teams from various universities and organizations. In this article, we present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results. The highest performances are 0.4392 in F1-score and 0.4009 in BLUE on the private test set. The multilingual QA systems proposed by the top 2 teams use ViT for the pre-trained vision model and mT5 for the pre-trained language model, a powerful pre-trained language model based on the transformer architecture. EVJVQA is a challenging dataset that motivates NLP and CV researchers to further explore the multilingual models or systems for visual question answering systems.","PeriodicalId":15444,"journal":{"name":"Journal of Computer Science and Cybernetics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15625/1813-9663/18157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addition, there is no multilingual dataset targeting the visual content of a particular country with its own objects and cultural characteristics. To address the weakness, we provide the research community with a benchmark dataset named EVJVQA, including 33,000+ pairs of question-answer over three languages: Vietnamese, English, and Japanese, on approximately 5,000 images taken from Vietnam for evaluating multilingual VQA systems or models. EVJVQA is used as a benchmark dataset for the challenge of multilingual visual question answering at the 9th Workshop on Vietnamese Language and Speech Processing (VLSP 2022). This task attracted 62 participant teams from various universities and organizations. In this article, we present details of the organization of the challenge, an overview of the methods employed by shared-task participants, and the results. The highest performances are 0.4392 in F1-score and 0.4009 in BLUE on the private test set. The multilingual QA systems proposed by the top 2 teams use ViT for the pre-trained vision model and mT5 for the pre-trained language model, a powerful pre-trained language model based on the transformer architecture. EVJVQA is a challenging dataset that motivates NLP and CV researchers to further explore the multilingual models or systems for visual question answering systems.
Evjvqa挑战:多语言视觉问答
视觉问答(Visual Question answer, VQA)是自然语言处理(NLP)和计算机视觉(CV)领域的一项具有挑战性的任务,受到了研究人员的广泛关注。英语是一种资源丰富的语言,它见证了可视化问答数据集和模型的各种发展。还将为资源和模型开发其他语言的可视化问答。此外,还没有针对具有自己的对象和文化特征的特定国家的视觉内容的多语言数据集。为了解决这一弱点,我们为研究界提供了一个名为EVJVQA的基准数据集,其中包括33,000多对三种语言的问答:越南语,英语和日语,取自越南的约5,000张图像,用于评估多语言VQA系统或模型。EVJVQA在第九届越南语言和语音处理研讨会(VLSP 2022)上被用作多语言视觉问答挑战的基准数据集。这项任务吸引了来自不同大学和组织的62支参赛队伍。在本文中,我们详细介绍了挑战的组织、共享任务参与者使用的方法的概述以及结果。在私人测试集上,f1得分最高为0.4392,BLUE得分最高为0.4009。前两名团队提出的多语言QA系统使用ViT作为预训练的视觉模型,使用mT5作为预训练的语言模型,这是一种基于transformer架构的强大的预训练语言模型。EVJVQA是一个具有挑战性的数据集,激励NLP和CV研究人员进一步探索视觉问答系统的多语言模型或系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信