Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval

IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong
{"title":"Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval","authors":"Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong","doi":"10.1109/TMM.2025.3543067","DOIUrl":null,"url":null,"abstract":"Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2848-2862"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891432/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.
三重编码器交互:跨模式食物检索的分层多粒度语义对齐
目前的跨模式食物检索方法主要关注食物的整体视觉外观,而没有明确考虑多粒度信息。此外,直接计算图像-配方对的全局相似度在潜在对齐方面不是特别有效,在相互图像-配方检索过程中存在不匹配问题。本文提出了一个三重编码器交互(TEI)跨模式食品检索框架,以保持食品图像的多粒度和文本食谱的多层次,以解决上述挑战。TEI框架包括一个图像编码器、一个配方编码器和一个多粒度交互编码器。同时,我们提出了嵌入在多粒度交互编码器中的多粒度关系感知注意(MRA)来捕获多粒度食物视觉特征。基于提取的分层文本和多粒度视觉特征,计算食谱与图像实体之间的多粒度交互相似度得分,以更好地建立食谱与图像实体之间的多粒度相关性。最后,设计了一个分层的多粒度语义对齐损失,利用多粒度交互相似度评分来监督跨模态训练的整个过程。在Recipe1M数据集上进行的大量定性和定量实验表明,所提出的TEI框架实现了图像和文本模式之间的多粒度语义对齐,并且在跨模式食物检索任务中优于其他最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信