Lei Zhu;Runbing Wu;Xinghui Zhu;Chengyuan Zhang;Lin Wu;Shichao Zhang;Xuelong Li
{"title":"Bi-Direction Label-Guided Semantic Enhancement for Cross-Modal Hashing","authors":"Lei Zhu;Runbing Wu;Xinghui Zhu;Chengyuan Zhang;Lin Wu;Shichao Zhang;Xuelong Li","doi":"10.1109/TCSVT.2024.3521646","DOIUrl":null,"url":null,"abstract":"Supervised cross-modal hashing has gained significant attention due to its efficiency in reducing storage and computation costs while maintaining rich semantic information. Despite substantial progress in generating compact binary codes, two key challenges remain: (1) insufficient utilization of labels to mine and fuse multi-grained semantic information, and (2) unreliable cross-modal interaction, which does not fully leverage multi-grained semantics or accurately capture sample relationships. To address these limitations, we propose a novel method called Bi-direction Label-Guided Semantic Enhancement for cross-modal Hashing (BiLGSEH). To tackle the first challenge, we introduce a label-guided semantic fusion strategy that extracts and integrates multi-grained semantic features guided by multi-labels. For the second challenge, we propose a semantic-enhanced relation aggregation strategy that constructs and aggregates multi-modal relational information through bi-directional similarity. Additionally, we incorporate CLIP features to improve the alignment between multi-modal content and complex semantics. In summary, BiLGSEH generates discriminative hash codes by effectively aligning semantic distribution and relational structure across modalities. Extensive performance evaluations against 18 competitive methods demonstrate the superiority of our approach. The source code for our method is publicly available at: <uri>https://github.com/yileicc/BiLGSEH</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3983-3999"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10813461/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Supervised cross-modal hashing has gained significant attention due to its efficiency in reducing storage and computation costs while maintaining rich semantic information. Despite substantial progress in generating compact binary codes, two key challenges remain: (1) insufficient utilization of labels to mine and fuse multi-grained semantic information, and (2) unreliable cross-modal interaction, which does not fully leverage multi-grained semantics or accurately capture sample relationships. To address these limitations, we propose a novel method called Bi-direction Label-Guided Semantic Enhancement for cross-modal Hashing (BiLGSEH). To tackle the first challenge, we introduce a label-guided semantic fusion strategy that extracts and integrates multi-grained semantic features guided by multi-labels. For the second challenge, we propose a semantic-enhanced relation aggregation strategy that constructs and aggregates multi-modal relational information through bi-directional similarity. Additionally, we incorporate CLIP features to improve the alignment between multi-modal content and complex semantics. In summary, BiLGSEH generates discriminative hash codes by effectively aligning semantic distribution and relational structure across modalities. Extensive performance evaluations against 18 competitive methods demonstrate the superiority of our approach. The source code for our method is publicly available at: https://github.com/yileicc/BiLGSEH.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.