CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP

2023 7th International Conference on Machine Vision and Information Technology (CMVIT) Pub Date : 2023-03-01 DOI:10.1109/cmvit57620.2023.00015

Siyuan Wang, Yuyao Yan, Xi Yang, Kaizhu Huang

{"title":"CRA: Text to Image Retrieval for Architecture Images by Chinese CLIP","authors":"Siyuan Wang, Yuyao Yan, Xi Yang, Kaizhu Huang","doi":"10.1109/cmvit57620.2023.00015","DOIUrl":null,"url":null,"abstract":"Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and freezing the pre-trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-to-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.","PeriodicalId":191655,"journal":{"name":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","volume":"225 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 7th International Conference on Machine Vision and Information Technology (CMVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cmvit57620.2023.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text-to-image retrieval is revolutionized since the Contrastive Language-Image Pre-training model was proposed. Most existing methods learn a latent representation of text and then align its embedding with the corresponding image’s embedding from an image encoder. Recently, several Chinese CLIP models have supported a good representation of paired image-text sets. However, adapting the pre-trained retrieval model to a professional domain still remains a challenge, mainly due to the large domain gap between the professional and general text-image sets. In this paper, we introduce a novel contrastive tuning model, named CRA, using Chinese texts to retrieve architecture-related images by fine-tuning the pre-trained Chinese CLIP. Instead of fine-tuning the whole CLIP model, we engage the Locked-image Text tuning (LiT) strategy to adapt the architecture-terminology sets by tuning the text encoder and freezing the pre-trained large-scale image encoder. We further propose a text-image dataset of architectural design. On the text-to-image retrieval task, we improve the metric of R@20 from 44.92% by the original Chinese CLIP model to 74.61% by our CRA model in the test set.

查看原文本刊更多论文

CRA:基于中文CLIP的建筑图像文本到图像检索

摘要对比语言-图像预训练模型的提出使文本-图像检索发生了革命性的变化。大多数现有方法学习文本的潜在表示，然后将其嵌入与来自图像编码器的相应图像嵌入对齐。最近，几个中文CLIP模型已经很好地支持了成对图像-文本集的表示。然而，将预训练的检索模型适应于专业领域仍然是一个挑战，这主要是由于专业文本图像集与一般文本图像集之间存在较大的领域差距。在本文中，我们引入了一种新的对比调优模型，称为CRA，通过对预训练的中文CLIP进行微调，使用中文文本检索与建筑相关的图像。我们没有对整个CLIP模型进行微调，而是采用锁定图像文本调整(LiT)策略，通过调整文本编码器和冻结预训练的大规模图像编码器来适应架构术语集。我们进一步提出了一个建筑设计的文本-图像数据集。在文本到图像的检索任务上，我们将R@20的度量从原始中文CLIP模型的44.92%提高到我们的CRA模型在测试集中的74.61%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 7th International Conference on Machine Vision and Information Technology (CMVIT)

自引率

0.00%

发文量