MS-SMF: A probabilistic–causal multi-layer framework for image–text matching

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-09-24 DOI:10.1016/j.ipm.2025.104407

Zhonghao Xi , Bengong Yu , Chenyue Li , Haoyu Wang , Shanlin Yang

{"title":"MS-SMF: A probabilistic–causal multi-layer framework for image–text matching","authors":"Zhonghao Xi , Bengong Yu , Chenyue Li , Haoyu Wang , Shanlin Yang","doi":"10.1016/j.ipm.2025.104407","DOIUrl":null,"url":null,"abstract":"<div><div>Image–text matching faces significant challenges due to the complex many-to-many semantic relationships between modalities, and existing methods remain deficient in both uncertainty modeling and causal robustness. To address these issues, we propose a multi-layer structural semantic matching framework (MS-SMF) that enhances cross-modal representations from a probabilistic–causal perspective. First, our method employs graph convolutional networks augmented with pseudo-coordinates and syntactic dependencies to precisely align image regions with text tokens, encoding each aligned pair as a Gaussian distribution to explicitly quantify semantic uncertainty. Next, these local Gaussian distributions are aggregated via structure-aware similarity weights into global Gaussian features, capturing holistic semantic context. Finally, we simulate causal interventions by randomly permuting the global mixture weights to generate counterfactual semantic distributions, and impose a contrastive loss between the original and counterfactual embeddings to enforce both invariance and discriminability. By synergizing the “soft” robustness of Gaussian modeling with the “hard” discriminative power of causal contrast, MS-SMF substantially improves the stability and discriminative capacity of cross-modal representations under structural perturbations. Experimental results on the MSCOCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods — achieving an rSum of 542.9 on Flickr30K, with image-to-text R@1 and text-to-image R@1 improved by 1.0% and 2.3%, respectively — validating its superior performance in complex and ambiguous matching scenarios.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104407"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003486","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Image–text matching faces significant challenges due to the complex many-to-many semantic relationships between modalities, and existing methods remain deficient in both uncertainty modeling and causal robustness. To address these issues, we propose a multi-layer structural semantic matching framework (MS-SMF) that enhances cross-modal representations from a probabilistic–causal perspective. First, our method employs graph convolutional networks augmented with pseudo-coordinates and syntactic dependencies to precisely align image regions with text tokens, encoding each aligned pair as a Gaussian distribution to explicitly quantify semantic uncertainty. Next, these local Gaussian distributions are aggregated via structure-aware similarity weights into global Gaussian features, capturing holistic semantic context. Finally, we simulate causal interventions by randomly permuting the global mixture weights to generate counterfactual semantic distributions, and impose a contrastive loss between the original and counterfactual embeddings to enforce both invariance and discriminability. By synergizing the “soft” robustness of Gaussian modeling with the “hard” discriminative power of causal contrast, MS-SMF substantially improves the stability and discriminative capacity of cross-modal representations under structural perturbations. Experimental results on the MSCOCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods — achieving an rSum of 542.9 on Flickr30K, with image-to-text R@1 and text-to-image R@1 improved by 1.0% and 2.3%, respectively — validating its superior performance in complex and ambiguous matching scenarios.

查看原文本刊更多论文

MS-SMF：一个用于图像-文本匹配的概率-因果多层框架

由于模态之间复杂的多对多语义关系，图像-文本匹配面临重大挑战，现有方法在不确定性建模和因果鲁棒性方面都存在不足。为了解决这些问题，我们提出了一个多层结构语义匹配框架（MS-SMF），从概率因果角度增强了跨模态表示。首先，我们的方法使用带有伪坐标和语法依赖的图卷积网络，将图像区域与文本标记精确对齐，将每个对齐对编码为高斯分布，以明确量化语义不确定性。接下来，通过结构感知的相似度权重将这些局部高斯分布聚合为全局高斯特征，捕获整体语义上下文。最后，我们通过随机排列全局混合权重来模拟因果干预，以生成反事实语义分布，并在原始嵌入和反事实嵌入之间施加对比损失，以增强不变性和可辨别性。通过将高斯建模的“软”鲁棒性与因果对比的“硬”判别能力相结合，MS-SMF显著提高了结构扰动下跨模态表征的稳定性和判别能力。在MSCOCO和Flickr30K数据集上的实验结果表明，我们的方法优于最先进的方法-在Flickr30K上实现了542.9的简历，图像到文本R@1和文本到图像R@1分别提高了1.0%和2.3% -验证了其在复杂和模糊匹配场景中的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.