Liuwu Li , Zhuoming Zheng , Yuqi Bu , Cantao Wu , Shubin Huang , Qingbao Huang , Yi Cai
{"title":"Grouped top-down reasoning with hierarchical window transformer for visual grounding","authors":"Liuwu Li , Zhuoming Zheng , Yuqi Bu , Cantao Wu , Shubin Huang , Qingbao Huang , Yi Cai","doi":"10.1016/j.ipm.2025.104222","DOIUrl":null,"url":null,"abstract":"<div><div>Visual grounding, which localizes objects in images based on natural language descriptions, requires effective processing of multi-scale visual inputs to capture both fine-grained details and global context for complex and diverse scenes. However, existing transformer-based methods face significant challenges when handling such inputs, including computational complexity that scales quadratically with spatial dimensions and difficulties in effectively aligning cross-scale information. To address these limitations, we propose Grouped Top-Down Reasoning with Hierarchical Window Transformer (GTD-HWT) with two key innovations: (1) a multi-scale input reconstruction strategy that partitions and reconstructs multi-scale inputs into hierarchically structured shorter sequences, effectively preserving both coarse and fine-grained information while reducing computational costs, and (2) a dual multi-head attention mechanism that enables semantic reasoning through parallel inter-window attention for coarse-grained understanding and subsequent intra-window attention for fine-grained refinement guided by coarse-grained priors. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg benchmarks demonstrate that our method achieves significant improvements over state-of-the-art approaches in both referring expression comprehension and segmentation tasks.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 6","pages":"Article 104222"},"PeriodicalIF":6.9000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001633","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Visual grounding, which localizes objects in images based on natural language descriptions, requires effective processing of multi-scale visual inputs to capture both fine-grained details and global context for complex and diverse scenes. However, existing transformer-based methods face significant challenges when handling such inputs, including computational complexity that scales quadratically with spatial dimensions and difficulties in effectively aligning cross-scale information. To address these limitations, we propose Grouped Top-Down Reasoning with Hierarchical Window Transformer (GTD-HWT) with two key innovations: (1) a multi-scale input reconstruction strategy that partitions and reconstructs multi-scale inputs into hierarchically structured shorter sequences, effectively preserving both coarse and fine-grained information while reducing computational costs, and (2) a dual multi-head attention mechanism that enables semantic reasoning through parallel inter-window attention for coarse-grained understanding and subsequent intra-window attention for fine-grained refinement guided by coarse-grained priors. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg benchmarks demonstrate that our method achieves significant improvements over state-of-the-art approaches in both referring expression comprehension and segmentation tasks.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.