ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li
{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":null,"url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\nbroad-domain manner as images, short videos, or live stream promotions. A\nunified and vectorized cross-domain production representation is essential. Due\nto large intra-product variance and high inter-product similarity in the\nbroad-domain scenario, a visual-only representation is inadequate. While\nAutomatic Speech Recognition (ASR) text derived from the short or live-stream\nvideos is readily accessible, how to de-noise the excessively noisy text for\nmultimodal representation learning is mostly untouched. We propose ASR-enhanced\nMultimodal Product Representation Learning (AMPere). In order to extract\nproduct-specific information from the raw ASR text, AMPere uses an\neasy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\ntogether with visual data, is then fed into a multi-branch network to generate\ncompact multimodal embeddings. Extensive experiments on a large-scale\ntri-domain dataset verify the effectiveness of AMPere in obtaining a unified\nmultimodal product representation that clearly improves cross-domain product\nretrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.02978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.
针对跨域产品检索的 ASR 增强型多模态表征学习
电子商务的多媒体化程度越来越高,产品以图片、短视频或现场直播推广的方式在国外展示。统一和矢量化的跨域生产表示是必不可少的。在广域场景中,产品内部差异大,产品之间相似度高,因此仅有视觉表示是不够的。虽然从短视频或直播视频中提取的自动语音识别(ASR)文本很容易获得,但如何通过多模态表征学习对噪声过大的文本进行去噪处理却大多没有涉及。我们提出了 ASR 增强多模态产品表征学习(AMPere)。为了从原始 ASR 文本中提取特定产品信息,AMPere 使用了一个易于实现的基于 LLM 的 ASR 文本摘要器。经过 LLM 总结的文本与视觉数据一起输入多分支网络,生成紧凑的多模态嵌入。在大型三域数据集上进行的大量实验验证了 AMPere 在获得统一的多模态产品表示法方面的有效性,这种表示法明显改善了跨域产品检索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信