基于自蒸馏的电子商务产品理解多模态预训练

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining Pub Date : 2023-02-27 DOI:10.1145/3539597.3570423

Shilei Liu, Lin Li, Jun Song, Yonghua Yang, Xiaoyi Zeng

{"title":"基于自蒸馏的电子商务产品理解多模态预训练","authors":"Shilei Liu, Lin Li, Jun Song, Yonghua Yang, Xiaoyi Zeng","doi":"10.1145/3539597.3570423","DOIUrl":null,"url":null,"abstract":"Product understanding refers to a series of product-centric tasks, such as classification, alignment and attribute values prediction, which requires fine-grained fusion of various modalities of products. Excellent product modeling ability will enhance the user experience and benefit search and recommendation systems. In this paper, we propose MBSD, a pre-trained vision-and-language model which can integrate the heterogeneous information of product in a single stream BERT-style architecture. Compared with current approaches, MBSD uses a lightweight convolutional neural network instead of a heavy feature extractor for image encoding, which has lower latency. Besides, we cleverly utilize user behavior data to design a two-stage pre-training task to understand products from different perspectives. In addition, there is an underlying imbalanced problem in multimodal pre-training, which will impairs downstream tasks. To this end, we propose a novel self-distillation strategy to transfer the knowledge in dominated modality to weaker modality, so that each modality can be fully tapped during pre-training. Experimental results on several product understanding tasks demonstrate that the performance of MBSD outperforms the competitive baselines.","PeriodicalId":227804,"journal":{"name":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal Pre-Training with Self-Distillation for Product Understanding in E-Commerce\",\"authors\":\"Shilei Liu, Lin Li, Jun Song, Yonghua Yang, Xiaoyi Zeng\",\"doi\":\"10.1145/3539597.3570423\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Product understanding refers to a series of product-centric tasks, such as classification, alignment and attribute values prediction, which requires fine-grained fusion of various modalities of products. Excellent product modeling ability will enhance the user experience and benefit search and recommendation systems. In this paper, we propose MBSD, a pre-trained vision-and-language model which can integrate the heterogeneous information of product in a single stream BERT-style architecture. Compared with current approaches, MBSD uses a lightweight convolutional neural network instead of a heavy feature extractor for image encoding, which has lower latency. Besides, we cleverly utilize user behavior data to design a two-stage pre-training task to understand products from different perspectives. In addition, there is an underlying imbalanced problem in multimodal pre-training, which will impairs downstream tasks. To this end, we propose a novel self-distillation strategy to transfer the knowledge in dominated modality to weaker modality, so that each modality can be fully tapped during pre-training. Experimental results on several product understanding tasks demonstrate that the performance of MBSD outperforms the competitive baselines.\",\"PeriodicalId\":227804,\"journal\":{\"name\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"volume\":\"95 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539597.3570423\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539597.3570423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

产品理解是指一系列以产品为中心的任务，如分类、对齐、属性值预测等，需要对产品的各种形态进行细粒度融合。优秀的产品建模能力将提升用户体验，有利于搜索和推荐系统。本文提出了一种预训练的视觉和语言模型MBSD，它可以将产品的异构信息集成到一个单一的流bert式架构中。与现有方法相比，MBSD使用轻量级的卷积神经网络代替繁重的特征提取器进行图像编码，具有较低的延迟。此外，我们巧妙地利用用户行为数据，设计了一个两阶段的预训练任务，从不同的角度理解产品。此外，在多模态预训练中存在潜在的不平衡问题，这将损害下游任务。为此，我们提出了一种新的自蒸馏策略，将优势模态中的知识转移到较弱模态中，以便在预训练时充分利用每个模态。在多个产品理解任务上的实验结果表明，MBSD的性能优于竞争基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal Pre-Training with Self-Distillation for Product Understanding in E-Commerce

Product understanding refers to a series of product-centric tasks, such as classification, alignment and attribute values prediction, which requires fine-grained fusion of various modalities of products. Excellent product modeling ability will enhance the user experience and benefit search and recommendation systems. In this paper, we propose MBSD, a pre-trained vision-and-language model which can integrate the heterogeneous information of product in a single stream BERT-style architecture. Compared with current approaches, MBSD uses a lightweight convolutional neural network instead of a heavy feature extractor for image encoding, which has lower latency. Besides, we cleverly utilize user behavior data to design a two-stage pre-training task to understand products from different perspectives. In addition, there is an underlying imbalanced problem in multimodal pre-training, which will impairs downstream tasks. To this end, we propose a novel self-distillation strategy to transfer the knowledge in dominated modality to weaker modality, so that each modality can be fully tapped during pre-training. Experimental results on several product understanding tasks demonstrate that the performance of MBSD outperforms the competitive baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量