基于注视的图像字幕的以人为中心的图像检索

2022 IEEE International Conference on Image Processing (ICIP) Pub Date : 2022-10-16 DOI:10.1109/ICIP46576.2022.9897949

Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama

{"title":"基于注视的图像字幕的以人为中心的图像检索","authors":"Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama","doi":"10.1109/ICIP46576.2022.9897949","DOIUrl":null,"url":null,"abstract":"This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.","PeriodicalId":387035,"journal":{"name":"2022 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Human-Centric Image Retrieval with Gaze-Based Image Captioning\",\"authors\":\"Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, M. Haseyama\",\"doi\":\"10.1109/ICIP46576.2022.9897949\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.\",\"PeriodicalId\":387035,\"journal\":{\"name\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP46576.2022.9897949\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP46576.2022.9897949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种基于注视的图像字幕的以人为中心的图像检索方法。虽然跨模态嵌入技术的发展使先进的图像检索成为可能，但许多方法只关注从图像和文本等内容中获得的信息。为了进一步扩展图像检索，有必要构建直接反映人类意图的检索技术。本文针对人眼注视信息中包含语义信息的特点，提出了一种基于注视信息的图像字幕检索方法。具体来说，我们构建了一个转换器，连接字幕和凝视轨迹(CGT)模型，该模型学习图像、人类提供的字幕和凝视轨迹之间的关系。我们的CGT模型通过将凝视轨迹划分为几个边界框来实现基于变换的学习，因此基于凝视的图像字幕变得可行。利用得到的字幕进行跨模态检索，可以实现以人为中心的图像检索。本文的技术贡献是通过基于变压器的编码器将注视轨迹转换为字幕。在实验中，通过对比跨模态嵌入方法，验证了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Human-Centric Image Retrieval with Gaze-Based Image Captioning

This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量