Zhonghao Xi , Bengong Yu , Chenyue Li , Haoyu Wang , Shanlin Yang
{"title":"MS-SMF: A probabilistic–causal multi-layer framework for image–text matching","authors":"Zhonghao Xi , Bengong Yu , Chenyue Li , Haoyu Wang , Shanlin Yang","doi":"10.1016/j.ipm.2025.104407","DOIUrl":null,"url":null,"abstract":"<div><div>Image–text matching faces significant challenges due to the complex many-to-many semantic relationships between modalities, and existing methods remain deficient in both uncertainty modeling and causal robustness. To address these issues, we propose a multi-layer structural semantic matching framework (MS-SMF) that enhances cross-modal representations from a probabilistic–causal perspective. First, our method employs graph convolutional networks augmented with pseudo-coordinates and syntactic dependencies to precisely align image regions with text tokens, encoding each aligned pair as a Gaussian distribution to explicitly quantify semantic uncertainty. Next, these local Gaussian distributions are aggregated via structure-aware similarity weights into global Gaussian features, capturing holistic semantic context. Finally, we simulate causal interventions by randomly permuting the global mixture weights to generate counterfactual semantic distributions, and impose a contrastive loss between the original and counterfactual embeddings to enforce both invariance and discriminability. By synergizing the “soft” robustness of Gaussian modeling with the “hard” discriminative power of causal contrast, MS-SMF substantially improves the stability and discriminative capacity of cross-modal representations under structural perturbations. Experimental results on the MSCOCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods — achieving an rSum of 542.9 on Flickr30K, with image-to-text R@1 and text-to-image R@1 improved by 1.0% and 2.3%, respectively — validating its superior performance in complex and ambiguous matching scenarios.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104407"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003486","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Image–text matching faces significant challenges due to the complex many-to-many semantic relationships between modalities, and existing methods remain deficient in both uncertainty modeling and causal robustness. To address these issues, we propose a multi-layer structural semantic matching framework (MS-SMF) that enhances cross-modal representations from a probabilistic–causal perspective. First, our method employs graph convolutional networks augmented with pseudo-coordinates and syntactic dependencies to precisely align image regions with text tokens, encoding each aligned pair as a Gaussian distribution to explicitly quantify semantic uncertainty. Next, these local Gaussian distributions are aggregated via structure-aware similarity weights into global Gaussian features, capturing holistic semantic context. Finally, we simulate causal interventions by randomly permuting the global mixture weights to generate counterfactual semantic distributions, and impose a contrastive loss between the original and counterfactual embeddings to enforce both invariance and discriminability. By synergizing the “soft” robustness of Gaussian modeling with the “hard” discriminative power of causal contrast, MS-SMF substantially improves the stability and discriminative capacity of cross-modal representations under structural perturbations. Experimental results on the MSCOCO and Flickr30K datasets demonstrate that our approach outperforms state-of-the-art methods — achieving an rSum of 542.9 on Flickr30K, with image-to-text R@1 and text-to-image R@1 improved by 1.0% and 2.3%, respectively — validating its superior performance in complex and ambiguous matching scenarios.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.