学习从描述性文本和图像中推断产品属性值

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining Pub Date : 2023-02-27 DOI:10.1145/3539597.3575786

Pablo Montalvo, Aghiles Salah

{"title":"学习从描述性文本和图像中推断产品属性值","authors":"Pablo Montalvo, Aghiles Salah","doi":"10.1145/3539597.3575786","DOIUrl":null,"url":null,"abstract":"Online marketplaces are able to offer a staggering array of products that no physical store can match. While this makes it more likely for customers to find what they want, in order for online providers to ensure a smooth and efficient user experience, they must maintain well-organized catalogs, which depends greatly on the availability of per-product attribute values such as color, material, brand, to name a few. Unfortunately, such information is often incomplete or even missing in practice, and therefore we have to resort to predictive models as well as other sources of information to impute missing attribute values. In this talk we present the deep learning-based approach that we have developed at Rakuten Group to extract attribute values from product descriptive texts and images. Starting from pretrained architectures to encode textual and visual modalities, we discuss several refinements and improvements that we find necessary to achieve satisfactory performance and meet strict business requirements, namely improving recall while maintaining a high precision (>= 95%). Our methodology is driven by a systematic investigation into several practical research questions surrounding multimodality, which we revisit in this talk. At the heart of our multimodal architecture, is a new method to combine modalities inspired by empirical cross-modality comparisons. We present the latter component in details, point out one of its major limitations, namely exacerbating the issue of modality collapse, i.e., when the model forgets one modality, and describe our mitigation to this problem based on a principled regularization scheme. We present various empirical results on both Rakuten data as well as public benchmark datasets, which provide evidence of the benefits of our approach compared to several strong baselines. We also share some insights to characterise the circumstances in which the proposed model offers the most significant improvements. We conclude this talk by criticising the current model and discussing possible future developments and improvements. Our model is successfully deployed in Rakuten Ichiba - a Rakuten marketplace - and we believe that our investigation into multimodal attribute value extraction for e-commerce will benefit other researchers and practitioners alike embarking on similar journeys.","PeriodicalId":227804,"journal":{"name":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","volume":"202 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning to Infer Product Attribute Values From Descriptive Texts and Images\",\"authors\":\"Pablo Montalvo, Aghiles Salah\",\"doi\":\"10.1145/3539597.3575786\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Online marketplaces are able to offer a staggering array of products that no physical store can match. While this makes it more likely for customers to find what they want, in order for online providers to ensure a smooth and efficient user experience, they must maintain well-organized catalogs, which depends greatly on the availability of per-product attribute values such as color, material, brand, to name a few. Unfortunately, such information is often incomplete or even missing in practice, and therefore we have to resort to predictive models as well as other sources of information to impute missing attribute values. In this talk we present the deep learning-based approach that we have developed at Rakuten Group to extract attribute values from product descriptive texts and images. Starting from pretrained architectures to encode textual and visual modalities, we discuss several refinements and improvements that we find necessary to achieve satisfactory performance and meet strict business requirements, namely improving recall while maintaining a high precision (>= 95%). Our methodology is driven by a systematic investigation into several practical research questions surrounding multimodality, which we revisit in this talk. At the heart of our multimodal architecture, is a new method to combine modalities inspired by empirical cross-modality comparisons. We present the latter component in details, point out one of its major limitations, namely exacerbating the issue of modality collapse, i.e., when the model forgets one modality, and describe our mitigation to this problem based on a principled regularization scheme. We present various empirical results on both Rakuten data as well as public benchmark datasets, which provide evidence of the benefits of our approach compared to several strong baselines. We also share some insights to characterise the circumstances in which the proposed model offers the most significant improvements. We conclude this talk by criticising the current model and discussing possible future developments and improvements. Our model is successfully deployed in Rakuten Ichiba - a Rakuten marketplace - and we believe that our investigation into multimodal attribute value extraction for e-commerce will benefit other researchers and practitioners alike embarking on similar journeys.\",\"PeriodicalId\":227804,\"journal\":{\"name\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"volume\":\"202 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539597.3575786\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539597.3575786","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在线市场能够提供各种各样的产品，这是实体店无法比拟的。虽然这使得客户更有可能找到他们想要的东西，但为了让在线供应商确保顺畅高效的用户体验，他们必须维护组织良好的目录，这在很大程度上取决于每个产品属性值的可用性，如颜色、材料、品牌等等。不幸的是，这些信息在实践中往往是不完整甚至缺失的，因此我们不得不求助于预测模型以及其他信息来源来推断缺失的属性值。在这次演讲中，我们将介绍我们在Rakuten Group开发的基于深度学习的方法，用于从产品描述性文本和图像中提取属性值。从预训练的架构开始，对文本和视觉模式进行编码，我们讨论了一些改进和改进，我们发现这些改进和改进是实现令人满意的性能和满足严格的业务需求所必需的，即在保持高精度(>= 95%)的同时提高召回率。我们的方法是通过对围绕多模态的几个实际研究问题的系统调查来驱动的，我们将在本次演讲中重新讨论这些问题。在我们的多模态架构的核心，是一种新的方法来结合模式的经验启发的跨模态比较。我们详细介绍了后一部分，指出了它的一个主要限制，即加剧了模态崩溃的问题，即当模型忘记一种模态时，并描述了我们基于原则性正则化方案对这一问题的缓解。我们在Rakuten数据和公共基准数据集上提供了各种实证结果，这些结果证明了与几个强大的基线相比，我们的方法具有优势。我们还分享了一些见解，以描述所提出的模型提供最显著改进的情况。最后，我们对当前模型进行了批评，并讨论了未来可能的发展和改进。我们的模型已经成功地应用于Rakuten Ichiba(乐天市场)，我们相信，我们对电子商务多模式属性价值提取的研究将使其他研究人员和实践者受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning to Infer Product Attribute Values From Descriptive Texts and Images

Online marketplaces are able to offer a staggering array of products that no physical store can match. While this makes it more likely for customers to find what they want, in order for online providers to ensure a smooth and efficient user experience, they must maintain well-organized catalogs, which depends greatly on the availability of per-product attribute values such as color, material, brand, to name a few. Unfortunately, such information is often incomplete or even missing in practice, and therefore we have to resort to predictive models as well as other sources of information to impute missing attribute values. In this talk we present the deep learning-based approach that we have developed at Rakuten Group to extract attribute values from product descriptive texts and images. Starting from pretrained architectures to encode textual and visual modalities, we discuss several refinements and improvements that we find necessary to achieve satisfactory performance and meet strict business requirements, namely improving recall while maintaining a high precision (>= 95%). Our methodology is driven by a systematic investigation into several practical research questions surrounding multimodality, which we revisit in this talk. At the heart of our multimodal architecture, is a new method to combine modalities inspired by empirical cross-modality comparisons. We present the latter component in details, point out one of its major limitations, namely exacerbating the issue of modality collapse, i.e., when the model forgets one modality, and describe our mitigation to this problem based on a principled regularization scheme. We present various empirical results on both Rakuten data as well as public benchmark datasets, which provide evidence of the benefits of our approach compared to several strong baselines. We also share some insights to characterise the circumstances in which the proposed model offers the most significant improvements. We conclude this talk by criticising the current model and discussing possible future developments and improvements. Our model is successfully deployed in Rakuten Ichiba - a Rakuten marketplace - and we believe that our investigation into multimodal attribute value extraction for e-commerce will benefit other researchers and practitioners alike embarking on similar journeys.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量