{"title":"Indic Visual Question Answering","authors":"A. Chandrasekar, Amey Shimpi, D. Naik","doi":"10.1109/SPCOM55316.2022.9840835","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a problem at the intersection of Computer Vision (CV) and Natural Language Processing (NLP) which involves using natural language to respond to questions based on the context of images. The majority of existing methods focus on monolingual models, particularly those that only support English. This paper proposes a novel dataset alongside monolingual and multilingual models using the baseline and attention-based architectures with support for three Indic languages: Hindi, Kannada, and Tamil. We compare the performance of traditional (CNN + LSTM) approaches with current attention-based methods using the VQA v2 dataset. The proposed work achieves 51.618% accuracy for Hindi, 57.177% for Kannada, and 56.061% for the Tamil model.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Visual Question Answering (VQA) is a problem at the intersection of Computer Vision (CV) and Natural Language Processing (NLP) which involves using natural language to respond to questions based on the context of images. The majority of existing methods focus on monolingual models, particularly those that only support English. This paper proposes a novel dataset alongside monolingual and multilingual models using the baseline and attention-based architectures with support for three Indic languages: Hindi, Kannada, and Tamil. We compare the performance of traditional (CNN + LSTM) approaches with current attention-based methods using the VQA v2 dataset. The proposed work achieves 51.618% accuracy for Hindi, 57.177% for Kannada, and 56.061% for the Tamil model.