Learning General and Specific Embedding with Transformer for Few-Shot Object Detection

IF 11.6 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xu Zhang, Zhe Chen, Jing Zhang, Tongliang Liu, Dacheng Tao
{"title":"Learning General and Specific Embedding with Transformer for Few-Shot Object Detection","authors":"Xu Zhang, Zhe Chen, Jing Zhang, Tongliang Liu, Dacheng Tao","doi":"10.1007/s11263-024-02199-0","DOIUrl":null,"url":null,"abstract":"<p>Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02199-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Few-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.

Abstract Image

利用变换器学习通用和特定嵌入,实现少镜头物体检测
少量物体检测(FSOD)研究如何利用少量注释示例有效地检测新物体。最近的研究表明,适当的特征嵌入(包括对视觉变化更不变的一般特征嵌入和对不同物体类别更有区分度的特定特征嵌入)对 FSOD 都很重要。然而,目前的方法缺乏适当的机制来根据这两类特征嵌入对检测新类别物体的重要性进行合理的合作,这可能会导致性能不达标。在本文中,为了实现更有效的 FSOD,我们尝试使用可学习张量对一般特征嵌入和特定特征嵌入进行明确编码,并应用变换器根据它们与输入对象特征的关系将它们更好地纳入 FSOD。因此,我们为 FSOD 提出了一种基于变换器的一般和特殊嵌入学习(T-GSEL)方法。在 T-GSEL 中,可学习的张量被用于一个三阶段的管道中,分别对一般级、中间级和特殊级的特征嵌入进行编码。在每个阶段,我们首先使用变换器来模拟相应嵌入与输入对象特征之间的关系,然后应用估计的关系来完善输入特征。同时,我们还进一步在不同阶段的嵌入之间引入跨阶段连接,使它们相互补充、相互配合,逐级提供一般、中间和特定特征嵌入,并将它们共同用于 FSOD 中的特征提纯。在实践中,T-GSEL 模块很容易注入。广泛的实证结果进一步表明,与其他最先进的方法相比,我们提出的 T-GSEL 方法在 PASCAL VOC 和 MS COCO 数据集上都取得了令人信服的 FSOD 性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Computer Vision
International Journal of Computer Vision 工程技术-计算机:人工智能
CiteScore
29.80
自引率
2.10%
发文量
163
审稿时长
6 months
期刊介绍: The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信