{"title":"基于胶囊网络的物体位置关系视觉问答","authors":"H. Yanagimoto, Riki Nakatani, Kiyota Hashimoto","doi":"10.1109/IIAIAAI55812.2022.00027","DOIUrl":null,"url":null,"abstract":"This paper presents a visual question answering (VQA) system focusing on object positional relation, which consists of Capsule Network and a recurrent neural language model. Grasping object positions in an image is necessary to understand the image and appropriately answer a question based on the image. Most related works employ state-of-the-art object recognition systems to detect objects in an image correctly and achieve higher accuracy for VQA datasets. However, It is difficult for the object recognition systems to extract enough object position from the image because of their architectures. The systems employ max-pooling to select representative features in an area of the image and the max-pooling tends to introduce position ambiguity. To overcome the drawback, we construct a VQA system with Capsule Network, which can capture object position information without max-pooling. For experiments, we choose only yes/no type questions from VQA dataset and the proposed method improves approximately 4% accuracy for the whole questions. Especially, the proposed method improves approximately 15% accuracy for questions including \"next to\" and \"front of\"","PeriodicalId":156230,"journal":{"name":"2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Visual Question Answering Focusing on Object Positional Relation with Capsule Network\",\"authors\":\"H. Yanagimoto, Riki Nakatani, Kiyota Hashimoto\",\"doi\":\"10.1109/IIAIAAI55812.2022.00027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a visual question answering (VQA) system focusing on object positional relation, which consists of Capsule Network and a recurrent neural language model. Grasping object positions in an image is necessary to understand the image and appropriately answer a question based on the image. Most related works employ state-of-the-art object recognition systems to detect objects in an image correctly and achieve higher accuracy for VQA datasets. However, It is difficult for the object recognition systems to extract enough object position from the image because of their architectures. The systems employ max-pooling to select representative features in an area of the image and the max-pooling tends to introduce position ambiguity. To overcome the drawback, we construct a VQA system with Capsule Network, which can capture object position information without max-pooling. For experiments, we choose only yes/no type questions from VQA dataset and the proposed method improves approximately 4% accuracy for the whole questions. Especially, the proposed method improves approximately 15% accuracy for questions including \\\"next to\\\" and \\\"front of\\\"\",\"PeriodicalId\":156230,\"journal\":{\"name\":\"2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI)\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IIAIAAI55812.2022.00027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAIAAI55812.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Visual Question Answering Focusing on Object Positional Relation with Capsule Network
This paper presents a visual question answering (VQA) system focusing on object positional relation, which consists of Capsule Network and a recurrent neural language model. Grasping object positions in an image is necessary to understand the image and appropriately answer a question based on the image. Most related works employ state-of-the-art object recognition systems to detect objects in an image correctly and achieve higher accuracy for VQA datasets. However, It is difficult for the object recognition systems to extract enough object position from the image because of their architectures. The systems employ max-pooling to select representative features in an area of the image and the max-pooling tends to introduce position ambiguity. To overcome the drawback, we construct a VQA system with Capsule Network, which can capture object position information without max-pooling. For experiments, we choose only yes/no type questions from VQA dataset and the proposed method improves approximately 4% accuracy for the whole questions. Especially, the proposed method improves approximately 15% accuracy for questions including "next to" and "front of"