{"title":"视觉问答算法的比较研究","authors":"A. Mostafa, Hazem M. Abbas, M. Khalil","doi":"10.1109/ICCES51560.2020.9334686","DOIUrl":null,"url":null,"abstract":"Visual Question Answering (VQA) is a recent task that challenges algorithms to reason about the visual content of an image to be able to answer a natural language question. In this study, we compare the performance of state of the art VQA algorithms on different VQA benchmarks. Each benchmark is more effective at testing VQA algorithms on different levels. Some datasets challenge the algorithms to perform complex reasoning steps to arrive to an answer. Other datasets might challenge algorithms to retrieve external world knowledge to answer the posed questions. We categorize the algorithms by their main contributions into 4 categories. Firstly, the joint embedding approach which focuses on how to map the visual and textual data into a common embedding space. Secondly, attention based methods which focuses on relevant parts of the image or the question. Thirdly, compositional models which deal with composing a model from smaller modules. Finally, we introduce external-knowledge based algorithms which need external sources to be able to retrieve facts necessary to answer a question when those facts may not be present in the scene nor in the whole training data set. We also mention other algorithms that don’t specifically belong to the aforementioned categories, but offers performance competitive with the state of the art.","PeriodicalId":247183,"journal":{"name":"2020 15th International Conference on Computer Engineering and Systems (ICCES)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative Study of Visual Question Answering Algorithms\",\"authors\":\"A. Mostafa, Hazem M. Abbas, M. Khalil\",\"doi\":\"10.1109/ICCES51560.2020.9334686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual Question Answering (VQA) is a recent task that challenges algorithms to reason about the visual content of an image to be able to answer a natural language question. In this study, we compare the performance of state of the art VQA algorithms on different VQA benchmarks. Each benchmark is more effective at testing VQA algorithms on different levels. Some datasets challenge the algorithms to perform complex reasoning steps to arrive to an answer. Other datasets might challenge algorithms to retrieve external world knowledge to answer the posed questions. We categorize the algorithms by their main contributions into 4 categories. Firstly, the joint embedding approach which focuses on how to map the visual and textual data into a common embedding space. Secondly, attention based methods which focuses on relevant parts of the image or the question. Thirdly, compositional models which deal with composing a model from smaller modules. Finally, we introduce external-knowledge based algorithms which need external sources to be able to retrieve facts necessary to answer a question when those facts may not be present in the scene nor in the whole training data set. We also mention other algorithms that don’t specifically belong to the aforementioned categories, but offers performance competitive with the state of the art.\",\"PeriodicalId\":247183,\"journal\":{\"name\":\"2020 15th International Conference on Computer Engineering and Systems (ICCES)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 15th International Conference on Computer Engineering and Systems (ICCES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES51560.2020.9334686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 15th International Conference on Computer Engineering and Systems (ICCES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES51560.2020.9334686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparative Study of Visual Question Answering Algorithms
Visual Question Answering (VQA) is a recent task that challenges algorithms to reason about the visual content of an image to be able to answer a natural language question. In this study, we compare the performance of state of the art VQA algorithms on different VQA benchmarks. Each benchmark is more effective at testing VQA algorithms on different levels. Some datasets challenge the algorithms to perform complex reasoning steps to arrive to an answer. Other datasets might challenge algorithms to retrieve external world knowledge to answer the posed questions. We categorize the algorithms by their main contributions into 4 categories. Firstly, the joint embedding approach which focuses on how to map the visual and textual data into a common embedding space. Secondly, attention based methods which focuses on relevant parts of the image or the question. Thirdly, compositional models which deal with composing a model from smaller modules. Finally, we introduce external-knowledge based algorithms which need external sources to be able to retrieve facts necessary to answer a question when those facts may not be present in the scene nor in the whole training data set. We also mention other algorithms that don’t specifically belong to the aforementioned categories, but offers performance competitive with the state of the art.