{"title":"基于语义保持和视觉转换的深度跨模态哈希检索","authors":"Jin Hong, Huayong Liu","doi":"10.1145/3573428.3573439","DOIUrl":null,"url":null,"abstract":"In response to the problem of similarity measure differences in different similarity coefficients that occur in cross-modal multi-label retrieval, this article uses an interval parameter to correct this bias. A new supervised hash method is proposed by introducing the transformer structure which performs well in CV and NLP tasks into cross-modal hash retrieval, called the Deep Semantics Preserving Vision Transformer Hashing (DSPVTH). This method uses network structures such as vision transformer to map different modal data into binary hash codes. It also uses the similarity relationship of multiple tags to maintain the semantic association between different modal data. Validation on four commonly used multimodal text datasets, Mirflickr25k, NUS-WIDE, COCO2014 and IAPR TC-12, shows a 2% to 8% improvement in average accuracy compared with the current optimal method, which means our method is robust and effective.","PeriodicalId":314698,"journal":{"name":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Cross-modal Hashing Retrieval Based on Semantics Preserving and Vision Transformer\",\"authors\":\"Jin Hong, Huayong Liu\",\"doi\":\"10.1145/3573428.3573439\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In response to the problem of similarity measure differences in different similarity coefficients that occur in cross-modal multi-label retrieval, this article uses an interval parameter to correct this bias. A new supervised hash method is proposed by introducing the transformer structure which performs well in CV and NLP tasks into cross-modal hash retrieval, called the Deep Semantics Preserving Vision Transformer Hashing (DSPVTH). This method uses network structures such as vision transformer to map different modal data into binary hash codes. It also uses the similarity relationship of multiple tags to maintain the semantic association between different modal data. Validation on four commonly used multimodal text datasets, Mirflickr25k, NUS-WIDE, COCO2014 and IAPR TC-12, shows a 2% to 8% improvement in average accuracy compared with the current optimal method, which means our method is robust and effective.\",\"PeriodicalId\":314698,\"journal\":{\"name\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"volume\":\"111 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3573428.3573439\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3573428.3573439","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep Cross-modal Hashing Retrieval Based on Semantics Preserving and Vision Transformer
In response to the problem of similarity measure differences in different similarity coefficients that occur in cross-modal multi-label retrieval, this article uses an interval parameter to correct this bias. A new supervised hash method is proposed by introducing the transformer structure which performs well in CV and NLP tasks into cross-modal hash retrieval, called the Deep Semantics Preserving Vision Transformer Hashing (DSPVTH). This method uses network structures such as vision transformer to map different modal data into binary hash codes. It also uses the similarity relationship of multiple tags to maintain the semantic association between different modal data. Validation on four commonly used multimodal text datasets, Mirflickr25k, NUS-WIDE, COCO2014 and IAPR TC-12, shows a 2% to 8% improvement in average accuracy compared with the current optimal method, which means our method is robust and effective.