使用多模式神经网络的产品描述和图像的相似性学习

Kazim Ali Mazhar , Matthias Brodtbeck , Gabriele Gühring
{"title":"使用多模式神经网络的产品描述和图像的相似性学习","authors":"Kazim Ali Mazhar ,&nbsp;Matthias Brodtbeck ,&nbsp;Gabriele Gühring","doi":"10.1016/j.nlp.2023.100029","DOIUrl":null,"url":null,"abstract":"<div><p>Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using <strong>ABO!</strong> (<strong>ABO!</strong>) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the <strong>ABO!</strong> dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the <strong>ABO!</strong> dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the <strong>ABO!</strong> dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"4 ","pages":"Article 100029"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Similarity learning of product descriptions and images using multimodal neural networks\",\"authors\":\"Kazim Ali Mazhar ,&nbsp;Matthias Brodtbeck ,&nbsp;Gabriele Gühring\",\"doi\":\"10.1016/j.nlp.2023.100029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using <strong>ABO!</strong> (<strong>ABO!</strong>) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the <strong>ABO!</strong> dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the <strong>ABO!</strong> dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the <strong>ABO!</strong> dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"4 \",\"pages\":\"Article 100029\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000262\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719123000262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

多模式深度学习是机器学习中的一个新兴研究课题,涉及对文本、图像和视听数据等不同模式数据的并行处理。众所周知的应用领域是多模式图像和视频处理以及语音识别。在本文中,我们提出了一种多模式神经网络,用于测量文字产品描述和图像的相似性,并在库存对账和搜索引擎优化中应用。我们开发了两个模型。第一种方法获取图像和文本数据,每个数据都经过卷积神经网络处理,并将这两种模式组合在一起。第二种是基于双向三重态损失函数。我们使用ABO进行实验!(ABO!)数据集和用于机械工程公司库存对账的行业相关数据集。我们的第一个模型在ABO上使用ResNet152实现了92.37%的准确率!在我们的行业相关数据集上,MobileNetV3_Large的支持率为99.11%。通过将该模型扩展到具有三个输入、两个文本输入和一个图像输入的模型,我们大大提高了性能,并在ABO上实现了97.57%的准确率!数据集,与我们的行业相关库存数据集的比率为99.83%。我们基于三重态丢失的第二个模型在ABO上的准确率仅为73.85%!数据集。然而,我们的实验表明,多模式网络在测量产品的相似性时始终表现得更好,即使在一种模式缺乏足够数据的情况下也是如此,因为它与另一种模式互补。我们提出的方法为搜索引擎的进一步优化开辟了几种可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Similarity learning of product descriptions and images using multimodal neural networks

Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using ABO! (ABO!) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the ABO! dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the ABO! dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the ABO! dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信