GSSF：用于深度跨模态度量学习的广义结构稀疏函数。

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-10-31 DOI:10.1109/TIP.2024.3485498

Haiwen Diao;Ying Zhang;Shang Gao;Jiawen Zhu;Long Chen;Huchuan Lu

{"title":"GSSF：用于深度跨模态度量学习的广义结构稀疏函数。","authors":"Haiwen Diao;Ying Zhang;Shang Gao;Jiawen Zhu;Long Chen;Huchuan Lu","doi":"10.1109/TIP.2024.3485498","DOIUrl":null,"url":null,"abstract":"Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6241-6252"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning\",\"authors\":\"Haiwen Diao;Ying Zhang;Shang Gao;Jiawen Zhu;Long Chen;Huchuan Lu\",\"doi\":\"10.1109/TIP.2024.3485498\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"33 \",\"pages\":\"6241-6252\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10740591/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10740591/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

跨模态度量学习是连接视觉和语言之间语义异质性的一个突出研究课题。现有的方法通常利用简单的余弦或复杂的距离度量来将成对特征转化为相似度得分，但这种方法存在距离测量能力不足或效率低下的问题。因此，我们提出了一种广义结构稀疏函数（Generalized Structural Sparse Function），以动态捕捉不同模态之间全面而强大的关系，从而实现成对相似性学习，同时保持简洁高效。具体来说，该距离度量微妙地囊括了对角线和块对角线两种形式的术语，在结构化和有组织的拓扑结构中自动区分并突出跨渠道的相关性和依赖性。因此，它能够适应配对特征之间的最佳匹配模式，并在模型复杂性和能力之间找到最佳平衡点。在跨模态和两个额外的单模态检索任务（图像-文本检索、人物再识别、细粒度图像检索）上的广泛实验验证了它优于各种流行检索框架的灵活性。更重要的是，我们进一步发现，它可以无缝集成到多种应用场景中，并以即插即用的方式展示了从注意力机制到知识蒸馏的广阔前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning

Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量