LangLoc: Language-Driven Localization via Formatted Spatial Description Generation

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-11 DOI:10.1109/TIP.2025.3546853

Weimin Shi;Changhao Chen;Kaige Li;Yuan Xiong;Xiaochun Cao;Zhong Zhou

{"title":"LangLoc: Language-Driven Localization via Formatted Spatial Description Generation","authors":"Weimin Shi;Changhao Chen;Kaige Li;Yuan Xiong;Xiaochun Cao;Zhong Zhou","doi":"10.1109/TIP.2025.3546853","DOIUrl":null,"url":null,"abstract":"Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization. Thus, this study introduces a new task, Language-driven Localization, and proposes a novel localization framework, LangLoc, which determines the user’s position and orientation through textual descriptions. Given the diversity of natural language descriptions, we first design a Spatial Description Generator (SDG), foundational to LangLoc, which extracts and combines the position and attribute information of objects within a scene to generate uniformly formatted textual descriptions. SDG eliminates the ambiguity of language, detailing the spatial layout and object relations of the scene, providing a reliable basis for localization. With generated descriptions, LangLoc effortlessly achieves language-only localization using text encoder and pose regressor. Furthermore, LangLoc can add one image to text input, achieving mutual optimization and feature adaptive fusion across modalities through two modality-specific encoders, cross-modal fusion, and multimodal joint learning strategies. This enhances the framework’s capability to handle complex scenes, achieving more accurate localization. Extensive experiments on the Oxford RobotCar, 4-Seasons, and Virtual Gallery datasets demonstrate LangLoc’s effectiveness in both language-only and visual-language localization across various outdoor and indoor scenarios. Notably, LangLoc achieves noticeable performance gains when using both text and image inputs in challenging conditions such as overexposure, low lighting, and occlusions, showcasing its superior robustness.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1737-1752"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10923622/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization. Thus, this study introduces a new task, Language-driven Localization, and proposes a novel localization framework, LangLoc, which determines the user’s position and orientation through textual descriptions. Given the diversity of natural language descriptions, we first design a Spatial Description Generator (SDG), foundational to LangLoc, which extracts and combines the position and attribute information of objects within a scene to generate uniformly formatted textual descriptions. SDG eliminates the ambiguity of language, detailing the spatial layout and object relations of the scene, providing a reliable basis for localization. With generated descriptions, LangLoc effortlessly achieves language-only localization using text encoder and pose regressor. Furthermore, LangLoc can add one image to text input, achieving mutual optimization and feature adaptive fusion across modalities through two modality-specific encoders, cross-modal fusion, and multimodal joint learning strategies. This enhances the framework’s capability to handle complex scenes, achieving more accurate localization. Extensive experiments on the Oxford RobotCar, 4-Seasons, and Virtual Gallery datasets demonstrate LangLoc’s effectiveness in both language-only and visual-language localization across various outdoor and indoor scenarios. Notably, LangLoc achieves noticeable performance gains when using both text and image inputs in challenging conditions such as overexposure, low lighting, and occlusions, showcasing its superior robustness.

查看原文本刊更多论文

通过格式化空间描述生成的语言驱动定位

现有的定位方法通常采用视觉感知场景并在gnss拒绝区域实现定位，但在复杂光照条件、动态物体或隐私保护区域的环境中往往难以实现定位。人类拥有用自然语言描述各种场景的能力，利用这些描述中丰富的语义信息，有效地推断出他们的位置。利用语言为健壮的本地化提供了一个潜在的解决方案。因此，本研究引入了一个新的任务——语言驱动的本地化，并提出了一种新的本地化框架——LangLoc，它通过文本描述来确定用户的位置和方向。考虑到自然语言描述的多样性，我们首先设计了一个基于LangLoc的空间描述生成器（Spatial Description Generator， SDG），提取并组合场景中物体的位置和属性信息，生成统一格式的文本描述。SDG消除了语言的模糊性，详细描述了场景的空间布局和对象关系，为定位提供了可靠的依据。通过生成描述，LangLoc可以使用文本编码器和姿态回归器轻松实现仅语言的定位。此外，LangLoc可以将一幅图像添加到文本输入中，通过两个模态特定的编码器、跨模态融合和多模态联合学习策略实现模态间的相互优化和特征自适应融合。这增强了框架处理复杂场景的能力，实现了更准确的定位。在Oxford RobotCar、4-Seasons和Virtual Gallery数据集上进行的大量实验表明，在各种室外和室内场景中，LangLoc在纯语言和视觉语言定位方面都是有效的。值得注意的是，在过度曝光、低光照和遮挡等具有挑战性的条件下，使用文本和图像输入时，LangLoc实现了显著的性能提升，展示了其卓越的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量