Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI:10.1145/3372278.3390681

Zichen Zan, Lin Li, Jianquan Liu, D. Zhou

{"title":"Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images","authors":"Zichen Zan, Lin Li, Jianquan Liu, D. Zhou","doi":"10.1145/3372278.3390681","DOIUrl":null,"url":null,"abstract":"In recent years, people are facing with billions of food images, videos and recipes on social medias. An appropriate technology is highly desired to retrieve accurate contents across food images and cooking recipes, like cross-modal retrieval framework. Based on our observations, the order of sequential sentences in recipes and the noises in food images will affect retrieval results. We take into account the sentence-level sequential orders of instructions and ingredients in recipes, and noise portion in food images to propose a new framework for cross-retrieval. In our framework, we propose three new strategies to improve the retrieval accuracy. (1) We encode recipe titles, ingredients, instructions in sentence level, and adopt three attention networks on multi-layer hidden state features separately to capture more semantic information. (2) We apply attention mechanism to select effective features from food images incorporating with recipe embeddings, and adopt an adversarial learning strategy to enhance modality alignment. (3) We design a new triplet loss scheme with an effective sampling strategy to reduce the noise impact on retrieval results. The experimental results show that our framework clearly outperforms the state-of-art methods in terms of median rank and recall rate at top k on the Recipe 1M dataset.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372278.3390681","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

In recent years, people are facing with billions of food images, videos and recipes on social medias. An appropriate technology is highly desired to retrieve accurate contents across food images and cooking recipes, like cross-modal retrieval framework. Based on our observations, the order of sequential sentences in recipes and the noises in food images will affect retrieval results. We take into account the sentence-level sequential orders of instructions and ingredients in recipes, and noise portion in food images to propose a new framework for cross-retrieval. In our framework, we propose three new strategies to improve the retrieval accuracy. (1) We encode recipe titles, ingredients, instructions in sentence level, and adopt three attention networks on multi-layer hidden state features separately to capture more semantic information. (2) We apply attention mechanism to select effective features from food images incorporating with recipe embeddings, and adopt an adversarial learning strategy to enhance modality alignment. (3) We design a new triplet loss scheme with an effective sampling strategy to reduce the noise impact on retrieval results. The experimental results show that our framework clearly outperforms the state-of-art methods in terms of median rank and recall rate at top k on the Recipe 1M dataset.

查看原文本刊更多论文

基于句子和噪声鲁棒的烹饪食谱和食物图像的跨模态检索

近年来，人们在社交媒体上面临着数十亿的食物图片、视频和食谱。迫切需要一种合适的技术来检索跨食物图像和烹饪食谱的准确内容，如跨模式检索框架。根据我们的观察，食谱中连续句子的顺序和食物图像中的噪声会影响检索结果。我们考虑了食谱中指令和配料的句子级顺序，以及食物图像中的噪声部分，提出了一种新的交叉检索框架。在我们的框架中，我们提出了三种新的策略来提高检索精度。(1)在句子层面对菜谱标题、配料、说明进行编码，并分别采用多层隐藏状态特征上的三种关注网络来捕获更多的语义信息。(2)采用注意机制从含有食谱嵌入的食物图像中选择有效特征，并采用对抗学习策略增强模态一致性。(3)设计了一种新的三重态损失方案，采用有效的采样策略来降低噪声对检索结果的影响。实验结果表明，我们的框架在Recipe 1M数据集的前k处的中位数排名和召回率方面明显优于最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 International Conference on Multimedia Retrieval

自引率

0.00%

发文量