Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI:10.1109/TMM.2025.3543067

Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong

{"title":"Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval","authors":"Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong","doi":"10.1109/TMM.2025.3543067","DOIUrl":null,"url":null,"abstract":"Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2848-2862"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891432/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.

查看原文本刊更多论文

三重编码器交互：跨模式食物检索的分层多粒度语义对齐

目前的跨模式食物检索方法主要关注食物的整体视觉外观，而没有明确考虑多粒度信息。此外，直接计算图像-配方对的全局相似度在潜在对齐方面不是特别有效，在相互图像-配方检索过程中存在不匹配问题。本文提出了一个三重编码器交互（TEI）跨模式食品检索框架，以保持食品图像的多粒度和文本食谱的多层次，以解决上述挑战。TEI框架包括一个图像编码器、一个配方编码器和一个多粒度交互编码器。同时，我们提出了嵌入在多粒度交互编码器中的多粒度关系感知注意（MRA）来捕获多粒度食物视觉特征。基于提取的分层文本和多粒度视觉特征，计算食谱与图像实体之间的多粒度交互相似度得分，以更好地建立食谱与图像实体之间的多粒度相关性。最后，设计了一个分层的多粒度语义对齐损失，利用多粒度交互相似度评分来监督跨模态训练的整个过程。在Recipe1M数据集上进行的大量定性和定量实验表明，所提出的TEI框架实现了图像和文本模式之间的多粒度语义对齐，并且在跨模式食物检索任务中优于其他最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.