Qisen Ma,Yan Huang,Zikun Liu,Hyunhee Park,Liang Wang
{"title":"面向无训练开放词汇目标检测的分层多模态知识匹配。","authors":"Qisen Ma,Yan Huang,Zikun Liu,Hyunhee Park,Liang Wang","doi":"10.1109/tip.2025.3618408","DOIUrl":null,"url":null,"abstract":"Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"117 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection.\",\"authors\":\"Qisen Ma,Yan Huang,Zikun Liu,Hyunhee Park,Liang Wang\",\"doi\":\"10.1109/tip.2025.3618408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.\",\"PeriodicalId\":13217,\"journal\":{\"name\":\"IEEE Transactions on Image Processing\",\"volume\":\"117 1\",\"pages\":\"\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Image Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tip.2025.3618408\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3618408","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Hierarchical Multimodal Knowledge Matching for Training-Free Open-Vocabulary Object Detection.
Open-Vocabulary Object Detection (OVOD) aims to leverage the generalization capabilities of pre-trained vision-language models for detecting objects beyond the trained categories. Existing methods mostly focus on supervised learning strategies based on available training data, which might be suboptimal for data-limited novel categories. To tackle this challenge, this paper presents a Hierarchical Multimodal Knowledge Matching method (HMKM) to better represent novel categories and match them with region features. Specifically, HMKM includes a set of object prototype knowledge that is obtained using limited category-specific images, acting as off-the-shelf category representations. In addition, HMKM also includes a set of attribute prototype knowledge to represent key attributes of categories at a fine-grained level, with the goal to distinguish one category from its visually similar ones. During inference, two sets of object and attribute prototype knowledge are adaptively combined to match categories with region features. The proposed HMKM is training-free and can be easily integrated as a plug-and-play module into existing OVOD models. Extensive experiments demonstrate that our HMKM significantly improves the performance when detecting novel categories across various backbones and datasets.
期刊介绍:
The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.