{"title":"Threefold Encoder Interaction: Hierarchical Multi-Grained Semantic Alignment for Cross-Modal Food Retrieval","authors":"Qi Wang;Dong Wang;Weidong Min;Di Gai;Qing Han;Cheng Zha;Yuling Zhong","doi":"10.1109/TMM.2025.3543067","DOIUrl":null,"url":null,"abstract":"Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2848-2862"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891432/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Current cross-modal food retrieval approaches focus mainly on the global visual appearance of food without explicitly considering multi-grained information. Additionally, direct calculation of the global similarity of image-recipe pairs is not particularly effective in terms of latent alignment, which suffers from mismatch during the mutual image-recipe retrieval process. This paper proposes a threefold encoder interaction (TEI) cross-modal food retrieval framework to maintain the multi-granularity of food images and the multi-levels of textual recipes to address the aforementioned challenges. The TEI framework comprises an image encoder, a recipe encoder, and a multi-grained interaction encoder. We simultaneously propose a multi-grained relation-aware attention (MRA) embedded in the multi-grained interaction encoder to capture multi-grained food visual features. The multi-grained interaction similarity scores are calculated to better establish the multi-grained correlation between recipe and image entities based on the extracted hierarchical textual and multi-grained visual features. Finally, a hierarchical multi-grained semantic alignment loss is designed to supervise the whole process of cross-modal training using the multi-grained interaction similarity scores. Extensive qualitative and quantitative experiments on the Recipe1M dataset have demonstrated that the proposed TEI framework achieves multi-grained semantic alignment between image and text modalities and is superior to other state-of-the-art methods in cross-modal food retrieval tasks.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.