{"title":"利用联合内部全局损失约束和大型视觉语言模型增强图像和音频的多标签深度哈希算法","authors":"Ye Liu;Yan Pan;Jian Yin","doi":"10.1109/LSP.2024.3455991","DOIUrl":null,"url":null,"abstract":"Deep hashing algorithms can transform high-dimensional features into low-dimensional hash codes, which can reduce storage space and improve computational efficiency in traditional information retrieval (IR) and large model related retrieval augmented generation (RAG) scenarios. In recent years, pre-trained convolutional or transformer networks are commonly chosen as the backbone in deep hashing frameworks. This involves incorporating local loss constraints among training samples, and then fine-tuning the model to generate hash codes. Due to the relatively limited local information of constraints among training samples, we propose to design the novel anchor constraint and structural constraint as internal global loss constraints with the vision transformer network, and augment external information by integrating the large vision-language model, thereby enhancing the performance of hash code generation. Additionally, to enhance the scalability of the novel deep hashing framework, we propose to incorporate the adapter module to extend its application from the image domain to the audio domain. By conducting comparative experiments and ablation analysis on various image and audio datasets, it can be confirmed that the proposed method achieves state-of-the-art retrieval results.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model\",\"authors\":\"Ye Liu;Yan Pan;Jian Yin\",\"doi\":\"10.1109/LSP.2024.3455991\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep hashing algorithms can transform high-dimensional features into low-dimensional hash codes, which can reduce storage space and improve computational efficiency in traditional information retrieval (IR) and large model related retrieval augmented generation (RAG) scenarios. In recent years, pre-trained convolutional or transformer networks are commonly chosen as the backbone in deep hashing frameworks. This involves incorporating local loss constraints among training samples, and then fine-tuning the model to generate hash codes. Due to the relatively limited local information of constraints among training samples, we propose to design the novel anchor constraint and structural constraint as internal global loss constraints with the vision transformer network, and augment external information by integrating the large vision-language model, thereby enhancing the performance of hash code generation. Additionally, to enhance the scalability of the novel deep hashing framework, we propose to incorporate the adapter module to extend its application from the image domain to the audio domain. By conducting comparative experiments and ablation analysis on various image and audio datasets, it can be confirmed that the proposed method achieves state-of-the-art retrieval results.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-09-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10669173/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669173/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Enhancing Multi-Label Deep Hashing for Image and Audio With Joint Internal Global Loss Constraints and Large Vision-Language Model
Deep hashing algorithms can transform high-dimensional features into low-dimensional hash codes, which can reduce storage space and improve computational efficiency in traditional information retrieval (IR) and large model related retrieval augmented generation (RAG) scenarios. In recent years, pre-trained convolutional or transformer networks are commonly chosen as the backbone in deep hashing frameworks. This involves incorporating local loss constraints among training samples, and then fine-tuning the model to generate hash codes. Due to the relatively limited local information of constraints among training samples, we propose to design the novel anchor constraint and structural constraint as internal global loss constraints with the vision transformer network, and augment external information by integrating the large vision-language model, thereby enhancing the performance of hash code generation. Additionally, to enhance the scalability of the novel deep hashing framework, we propose to incorporate the adapter module to extend its application from the image domain to the audio domain. By conducting comparative experiments and ablation analysis on various image and audio datasets, it can be confirmed that the proposed method achieves state-of-the-art retrieval results.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.