{"title":"零镜头视觉接地的错误感知生成推理","authors":"Yuqi Bu;Xin Wu;Yi Cai;Qiong Liu;Tao Wang;Qingbao Huang","doi":"10.1109/TMM.2025.3543062","DOIUrl":null,"url":null,"abstract":"Zero-shot visual grounding is the task of identifying and localizing an object in an image based on a referring expression without task-specific training. Existing methods employ heuristic rules to step-by-step perform visual perception for visual grounding. Despite their remarkable performance, there are still two limitations. First, such a rule-based manner struggles with expressions that are not covered by predefined rules. Second, existing methods lack a mechanism for identifying and correcting visual perceptual errors of incomplete information, resulting in cascading errors caused by reasoning based on incomplete visual perception results. In this article, we propose an Error-Aware Generative Reasoning (EAGR) method for zero-shot visual grounding. To address the limited adaptability of existing methods, a reasoning chain generator is presented, which prompts LLMs to dynamically generate reasoning chains for specific referring expressions. This generative manner eliminates the reliance on human-written heuristic rules. To mitigate visual perceptual errors of incomplete information, an error-aware mechanism is presented to elicit LLMs to identify these errors and explore correction strategies. Experimental results on four benchmarks show that EAGR outperforms state-of-the-art zero-shot methods by up to 10% and an average of 7%.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4844-4855"},"PeriodicalIF":9.7000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Error-Aware Generative Reasoning for Zero-Shot Visual Grounding\",\"authors\":\"Yuqi Bu;Xin Wu;Yi Cai;Qiong Liu;Tao Wang;Qingbao Huang\",\"doi\":\"10.1109/TMM.2025.3543062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Zero-shot visual grounding is the task of identifying and localizing an object in an image based on a referring expression without task-specific training. Existing methods employ heuristic rules to step-by-step perform visual perception for visual grounding. Despite their remarkable performance, there are still two limitations. First, such a rule-based manner struggles with expressions that are not covered by predefined rules. Second, existing methods lack a mechanism for identifying and correcting visual perceptual errors of incomplete information, resulting in cascading errors caused by reasoning based on incomplete visual perception results. In this article, we propose an Error-Aware Generative Reasoning (EAGR) method for zero-shot visual grounding. To address the limited adaptability of existing methods, a reasoning chain generator is presented, which prompts LLMs to dynamically generate reasoning chains for specific referring expressions. This generative manner eliminates the reliance on human-written heuristic rules. To mitigate visual perceptual errors of incomplete information, an error-aware mechanism is presented to elicit LLMs to identify these errors and explore correction strategies. Experimental results on four benchmarks show that EAGR outperforms state-of-the-art zero-shot methods by up to 10% and an average of 7%.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"4844-4855\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-03-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10912743/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10912743/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Error-Aware Generative Reasoning for Zero-Shot Visual Grounding
Zero-shot visual grounding is the task of identifying and localizing an object in an image based on a referring expression without task-specific training. Existing methods employ heuristic rules to step-by-step perform visual perception for visual grounding. Despite their remarkable performance, there are still two limitations. First, such a rule-based manner struggles with expressions that are not covered by predefined rules. Second, existing methods lack a mechanism for identifying and correcting visual perceptual errors of incomplete information, resulting in cascading errors caused by reasoning based on incomplete visual perception results. In this article, we propose an Error-Aware Generative Reasoning (EAGR) method for zero-shot visual grounding. To address the limited adaptability of existing methods, a reasoning chain generator is presented, which prompts LLMs to dynamically generate reasoning chains for specific referring expressions. This generative manner eliminates the reliance on human-written heuristic rules. To mitigate visual perceptual errors of incomplete information, an error-aware mechanism is presented to elicit LLMs to identify these errors and explore correction strategies. Experimental results on four benchmarks show that EAGR outperforms state-of-the-art zero-shot methods by up to 10% and an average of 7%.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.