{"title":"CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP","authors":"Siyuan Wang, Yuyao Yan, Xi Yang, Kaizhu Huang","doi":"10.1109/cmvit57620.2023.00015","DOIUrl":null,"url":null,"abstract":"Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and freezing the pre-trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-to-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.","PeriodicalId":191655,"journal":{"name":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","volume":"225 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cmvit57620.2023.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and freezing the pre-trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-to-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.