HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557624

Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin

{"title":"HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection","authors":"Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin","doi":"10.1109/TMM.2025.3557624","DOIUrl":null,"url":null,"abstract":"Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3171-3183"},"PeriodicalIF":8.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948367/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.

查看原文本刊更多论文

HA-FGOVD：通过显式线性组合突出细粒度属性用于开放词汇对象检测

开放词汇目标检测（Open-vocabulary object detection， OVD）模型由于具有广泛的训练数据和大量的参数，被认为是大型多模态模型（Large Multi-modal model， LMM）。主流OVD模型优先考虑对象的粗粒度类别，而不是关注它们的细粒度属性，例如颜色或材料，因此无法识别具有特定属性的对象。尽管对具有丰富属性信息的大规模图像-文本对进行了预训练，但其潜在特征空间并不突出这些细粒度属性。本文介绍了一种通用的显式方法HA-FGOVD，该方法通过在显式线性空间中突出细粒度属性来增强冻结OVD模型的属性级检测能力。我们的方法使用LLM从输入文本中提取属性词作为零射击任务。然后，调整标记注意掩码以指导文本编码器提取全局特征和特定属性特征，这些特征在线性空间中显式合成为两个向量，形成新的属性突出显示特征用于检测任务。组合权重标量可以在不同的OVD模型之间学习或转移，显示了我们的方法的通用性。实验结果表明，HA-FGOVD在FG-OVD基准测试上达到了最先进的性能，并在OVDEval基准测试上表现出了很好的泛化效果，这表明我们的方法解决了细粒度属性检测的重大局限性，具有更广泛的细粒度检测应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.