PRO-CLIP：通过原型和正则化优化运输实现基于 CLIP 的类别测量网络

IF 5.6 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Instrumentation and Measurement Pub Date : 2024-10-23 DOI:10.1109/TIM.2024.3485403

He Cao;Yunzhou Zhang;Shangdong Zhu;Lei Wang

{"title":"PRO-CLIP：通过原型和正则化优化运输实现基于 CLIP 的类别测量网络","authors":"He Cao;Yunzhou Zhang;Shangdong Zhu;Lei Wang","doi":"10.1109/TIM.2024.3485403","DOIUrl":null,"url":null,"abstract":"In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at \n<uri>https://github.com/NeuCV-IRMI/proclip</uri>\n.","PeriodicalId":13341,"journal":{"name":"IEEE Transactions on Instrumentation and Measurement","volume":"73 ","pages":"1-18"},"PeriodicalIF":5.6000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PRO-CLIP: A CLIP-Based Category Measurement Network Through Prototype and Regularized Optimal Transportation\",\"authors\":\"He Cao;Yunzhou Zhang;Shangdong Zhu;Lei Wang\",\"doi\":\"10.1109/TIM.2024.3485403\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at \\n<uri>https://github.com/NeuCV-IRMI/proclip</uri>\\n.\",\"PeriodicalId\":13341,\"journal\":{\"name\":\"IEEE Transactions on Instrumentation and Measurement\",\"volume\":\"73 \",\"pages\":\"1-18\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Instrumentation and Measurement\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10733835/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Instrumentation and Measurement","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10733835/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

在非结构化环境中，机器人很可能会遇到从未见过的桌面物体。对这些物体进行精确分类是完成特定物体操作任务的先决条件。然而，收集大规模物体分类数据集非常耗时。受提示调整方法的启发，我们提出了 PRO-CLIP 网络，这是一种桌面物体类别测量方法。具体来说，PRO-CLIP 基于预训练的视觉语言模型（VLM）的知识执行少量分类。它利用标记级和提示级最佳传输（OT）来共同微调可学习的视觉语言提示。在令牌级阶段，我们提出了图像补丁加权（PR）机制，使对齐集中在接近补丁原型的图像补丁上。这使得补丁嵌入具有趋同的类别表示，从而减少了视觉特征的类内差异。在提示级阶段，我们提出了级联 OT（COT）模块，以同时考虑零镜头特征中的任务相关知识和少镜头特征中的任务特定知识。由于任务非相关知识的泛化性能，所提出的模块在 OT 过程中实现了特征正则化。最后，我们提出了 UP loss 来监督整个网络。它包含非平衡对数级一致性损失和视觉原型损失。对数级一致性损失用于使可学习的特征接近零点特征。原型损失使视觉特征在距离上接近相应的原型。我们在不同的数据集（包括桌面对象）上进行了几次分类实验，证明了我们方法的有效性。相关代码将发布在 https://github.com/NeuCV-IRMI/proclip 网站上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PRO-CLIP: A CLIP-Based Category Measurement Network Through Prototype and Regularized Optimal Transportation

In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at https://github.com/NeuCV-IRMI/proclip .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Instrumentation and Measurement 工程技术-工程：电子与电气

CiteScore

9.00

自引率

23.20%

发文量

1294

审稿时长

3.9 months

期刊介绍： Papers are sought that address innovative solutions to the development and use of electrical and electronic instruments and equipment to measure, monitor and/or record physical phenomena for the purpose of advancing measurement science, methods, functionality and applications. The scope of these papers may encompass: (1) theory, methodology, and practice of measurement; (2) design, development and evaluation of instrumentation and measurement systems and components used in generating, acquiring, conditioning and processing signals; (3) analysis, representation, display, and preservation of the information obtained from a set of measurements; and (4) scientific and technical support to establishment and maintenance of technical standards in the field of Instrumentation and Measurement.