Deconfounded fashion image captioning with transformer and multimodal retrieval

Q1 Computer Science

Virtual Reality Intelligent Hardware Pub Date : 2025-04-01 DOI:10.1016/j.vrih.2024.08.002

Tao Peng, Weiqiao Yin, Junping Liu, Li Li, Xinrong Hu

{"title":"Deconfounded fashion image captioning with transformer and multimodal retrieval","authors":"Tao Peng, Weiqiao Yin, Junping Liu, Li Li, Xinrong Hu","doi":"10.1016/j.vrih.2024.08.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities.</div></div><div><h3>Method</h3><div>In this work, we propose the Deconfounded Fashion Image Captioning (DFIC) framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.</div></div><div><h3>Results</h3><div>Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.</div></div>","PeriodicalId":33538,"journal":{"name":"Virtual Reality Intelligent Hardware","volume":"7 2","pages":"Pages 127-138"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virtual Reality Intelligent Hardware","FirstCategoryId":"1093","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2096579624000494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Background

The annotation of fashion images is a significantly important task in the fashion industry as well as social media and e-commerce. However, owing to the complexity and diversity of fashion images, this task entails multiple challenges, including the lack of fine-grained captions and confounders caused by dataset bias. Specifically, confounders often cause models to learn spurious correlations, thereby reducing their generalization capabilities.

Method

In this work, we propose the Deconfounded Fashion Image Captioning (DFIC) framework, which first uses multimodal retrieval to enrich the predicted captions of clothing, and then constructs a detailed causal graph using causal inference in the decoder to perform deconfounding. Multimodal retrieval is used to obtain semantic words related to image features, which are input into the decoder as prompt words to enrich sentence descriptions. In the decoder, causal inference is applied to disentangle visual and semantic features while concurrently eliminating visual and language confounding.

Results

Overall, our method can not only effectively enrich the captions of target images, but also greatly reduce confounders caused by the dataset. To verify the effectiveness of the proposed framework, the model was experimentally verified using the FACAD dataset.

查看原文本刊更多论文

用变压器和多模态检索解构时尚图像字幕

背景时尚图片注释是时尚产业、社交媒体和电子商务中一项非常重要的任务。然而，由于时尚图片的复杂性和多样性，这项任务面临着多重挑战，包括缺乏细粒度标题和数据集偏差造成的混杂因素。具体来说，混杂因素往往会导致模型学习到虚假的相关性，从而降低模型的泛化能力。在这项工作中，我们提出了去混杂时尚图片字幕框架（DFIC），该框架首先使用多模态检索来丰富预测的服装字幕，然后在解码器中使用因果推理来构建详细的因果图，从而执行去混杂。多模态检索用于获取与图像特征相关的语义词，并将其作为提示词输入解码器，以丰富句子描述。在解码器中，应用因果推理来分离视觉和语义特征，同时消除视觉和语言混淆。结果总的来说，我们的方法不仅能有效地丰富目标图像的标题，还能大大减少数据集造成的混淆。为了验证所提框架的有效性，我们使用 FACAD 数据集对该模型进行了实验验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊