通过早期异质融合进行深度多模态学习以增强食品信息

The Visual Computer Pub Date : 2024-07-03 DOI:10.1007/s00371-024-03546-5

Avantika Saklani, Shailendra Tiwari, H. S. Pannu

{"title":"通过早期异质融合进行深度多模态学习以增强食品信息","authors":"Avantika Saklani, Shailendra Tiwari, H. S. Pannu","doi":"10.1007/s00371-024-03546-5","DOIUrl":null,"url":null,"abstract":"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"92 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion\",\"authors\":\"Avantika Saklani, Shailendra Tiwari, H. S. Pannu\",\"doi\":\"10.1007/s00371-024-03546-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"92 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03546-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03546-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

与单一模态内容相比，多模态数据可以更生动、更有效地深入了解食品统计数据。但传统的食品分类系统侧重于单个模式。由于每天都有大量数据涌现，吸引了这一领域的研究人员，因此这种方法是徒劳的。此外，现有的多模态印度食品数据集非常少。在研究这些发现的基础上，我们建立了一个基于深度多模态融合网络（DAMFN）的新型多模态食品分析模型，以实现语言和视觉的融合。该模型包括三个阶段：功能特征提取、早期融合和特征分类。在功能特征提取中，对来自各个模态的深度特征进行抽象。然后，利用模态之间的深度相关性进行早期融合。最后，将融合后的特征提供给分类系统，以便在特征分类阶段做出最终决定。为了实验目的，我们进一步开发了一个数据集，其中包含印度食品图像及其相关说明。此外，我们还在一个名为 UPMC Food 101 的大型数据集上对所提出的方法进行了评估，该数据集共有 90 704 个实例。实验结果表明，所提出的 DAMFN 优于几种最先进的多模态食品分类技术以及单个模态系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion

查看原文本刊更多论文

Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion

In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Visual Computer

自引率

0.00%

发文量