Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama
{"title":"基于注视的图像字幕的以人为中心的图像检索","authors":"Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama","doi":"10.1109/ICIP46576.2022.9897949","DOIUrl":null,"url":null,"abstract":"This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.","PeriodicalId":387035,"journal":{"name":"2022 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Human-Centric Image Retrieval with Gaze-Based Image Captioning\",\"authors\":\"Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama\",\"doi\":\"10.1109/ICIP46576.2022.9897949\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.\",\"PeriodicalId\":387035,\"journal\":{\"name\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP46576.2022.9897949\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP46576.2022.9897949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Human-Centric Image Retrieval with Gaze-Based Image Captioning
This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.