Deconfounded fashion image captioning with transformer and multimodal retrieval

Q1 Computer Science
Tao Peng, Weiqiao Yin, Junping Liu, Li Li, Xinrong Hu
{"title":"Deconfounded fashion image captioning with transformer and multimodal retrieval","authors":"Tao Peng,&nbsp;Weiqiao Yin,&nbsp;Junping Liu,&nbsp;Li Li,&nbsp;Xinrong Hu","doi":"10.1016/j.vrih.2024.08.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities.</div></div><div><h3>Method</h3><div>In this work, we propose the Deconfounded Fashion Image Captioning (DFIC) framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.</div></div><div><h3>Results</h3><div>Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.</div></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"7 2","pages":"Pages 127-138"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2096579624000494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Background

The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities.

Method

In this work, we propose the Deconfounded Fashion Image Captioning (DFIC) framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.

Results

Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.
用变压器和多模态检索解构时尚图像字幕
背景时尚图片注释是时尚产业、社交媒体和电子商务中一项非常重要的任务。然而,由于时尚图片的复杂性和多样性,这项任务面临着多重挑战,包括缺乏细粒度标题和数据集偏差造成的混杂因素。具体来说,混杂因素往往会导致模型学习到虚假的相关性,从而降低模型的泛化能力。在这项工作中,我们提出了去混杂时尚图片字幕框架(DFIC),该框架首先使用多模态检索来丰富预测的服装字幕,然后在解码器中使用因果推理来构建详细的因果图,从而执行去混杂。多模态检索用于获取与图像特征相关的语义词,并将其作为提示词输入解码器,以丰富句子描述。在解码器中,应用因果推理来分离视觉和语义特征,同时消除视觉和语言混淆。结果总的来说,我们的方法不仅能有效地丰富目标图像的标题,还能大大减少数据集造成的混淆。为了验证所提框架的有效性,我们使用 FACAD 数据集对该模型进行了实验验证。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Virtual Reality  Intelligent Hardware
Virtual Reality Intelligent Hardware Computer Science-Computer Graphics and Computer-Aided Design
CiteScore
6.40
自引率
0.00%
发文量
35
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信