TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval

Proceedings of the 2022 International Conference on Multimedia Retrieval Pub Date : 2021-05-05 DOI:10.1145/3512527.3531405

Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University

{"title":"TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval","authors":"Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University","doi":"10.1145/3512527.3531405","DOIUrl":null,"url":null,"abstract":"Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512527.3531405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.

查看原文本刊更多论文

TransHash:基于变换的汉明哈希高效图像检索

深度哈希在大规模图像检索的近似近邻搜索中越来越受欢迎。到目前为止，图像检索社区的深度哈希一直由卷积神经网络架构主导，例如Resnet[22]。在本文中，受视觉转换器的最新进展的启发，我们提出了Transhash，一个纯粹基于转换器的深度哈希学习框架。具体来说，我们的框架由两个主要模块组成:(1)基于Vision Transformer (ViT)，我们设计了一个用于图像特征提取的siamese Multi-Granular Vision Transformer backbone (MGVT)。为了学习细粒度特征，我们在变压器的基础上创新了双流多粒度特征学习，以学习判别性的全局和局部特征。(2)采用动态构造相似矩阵的贝叶斯学习方案学习紧凑二进制哈希码。整个框架以端到端的方式进行联合训练。据我们所知，这是第一个不使用卷积神经网络(cnn)来解决深度哈希学习问题的工作。我们在三个广泛研究的数据集上进行了全面的实验:CIFAR-10, NUSWIDE和IMAGENET。实验证明了我们比现有的最先进的深度哈希方法的优越性。具体来说，我们在三个公共数据集上对不同哈希位长度的平均mAP分别实现了8.2%、2.6%和12.7%的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 International Conference on Multimedia Retrieval

自引率

0.00%

发文量