Kazim Ali Mazhar , Matthias Brodtbeck , Gabriele Gühring
{"title":"使用多模式神经网络的产品描述和图像的相似性学习","authors":"Kazim Ali Mazhar , Matthias Brodtbeck , Gabriele Gühring","doi":"10.1016/j.nlp.2023.100029","DOIUrl":null,"url":null,"abstract":"<div><p>Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using <strong>ABO!</strong> (<strong>ABO!</strong>) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the <strong>ABO!</strong> dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the <strong>ABO!</strong> dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the <strong>ABO!</strong> dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"4 ","pages":"Article 100029"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Similarity learning of product descriptions and images using multimodal neural networks\",\"authors\":\"Kazim Ali Mazhar , Matthias Brodtbeck , Gabriele Gühring\",\"doi\":\"10.1016/j.nlp.2023.100029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using <strong>ABO!</strong> (<strong>ABO!</strong>) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the <strong>ABO!</strong> dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the <strong>ABO!</strong> dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the <strong>ABO!</strong> dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"4 \",\"pages\":\"Article 100029\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000262\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719123000262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Similarity learning of product descriptions and images using multimodal neural networks
Multimodal deep learning is an emerging research topic in machine learning and involves the parallel processing of different modalities of data such as texts, images and audiovisual data. Well-known application areas are multimodal image and video processing as well as speech recognition. In this paper, we propose a multimodal neural network that measures the similarity of text-written product descriptions and images and has applications in inventory reconciliation and search engine optimization. We develop two models. The first takes image and text data, each processed by convolutional neural networks, and combines the two modalities. The second is based on a bidirectional triplet loss function. We conduct experiments using ABO! (ABO!) dataset and an industry-related dataset used for the inventory reconciliation of a mechanical engineering company. Our first model achieves an accuracy of 92.37% with ResNet152 on the ABO! dataset and 99.11% with MobileNetV3_Large on our industry-related dataset. By extending this model to a model with three inputs, two text inputs and one image input, we greatly improve the performance and achieve an accuracy of 97.57% on the ABO! dataset and 99.83% with our industry related inventory dataset. Our second model based on the triplet loss achieves only an accuracy of 73.85% on the ABO! dataset. However, our experiments demonstrate that multimodal networks consistently perform better when measuring the similarity of products, even in situations where one modality lacks sufficient data, because it is complemented with the other modality. Our proposed approaches open up several possibilities for further optimization of search engines.