{"title":"HA-FGOVD: Highlighting Fine-Grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection","authors":"Yuqi Ma;Mengyin Liu;Chao Zhu;Xu-Cheng Yin","doi":"10.1109/TMM.2025.3557624","DOIUrl":null,"url":null,"abstract":"Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3171-3183"},"PeriodicalIF":8.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948367/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. Despite being pretrained on large-scale image-text pairs with rich attribute information, their latent feature space does not highlight these fine-grained attributes. In this paper, we introduce HA-FGOVD, a universal and explicit method that enhances the attribute-level detection capabilities of frozen OVD models by highlighting fine-grained attributes in explicit linear space. Our approach uses a LLM to extract attribute words in input text as a zero-shot task. Then, token attention masks are adjusted to guide text encoders in extracting both global and attribute-specific features, which are explicitly composited as two vectors in linear space to form a new attribute-highlighted feature for detection tasks. The composition weight scalars can be learned or transferred across different OVD models, showcasing the universality of our method. Experimental results show that HA-FGOVD achieves state-of-the-art performance on the FG-OVD benchmark and demonstrates promising generalization on the OVDEval benchmark, suggesting that our method addresses significant limitations in fine-grained attribute detection and has potential for broader fine-grained detection applications.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.