SceneLLM: LLM中用于动态场景图生成的隐式语言推理

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-06-23 DOI:10.1016/j.patcog.2025.111992

Hang Zhang , Zhuoling Li , Jun Liu

{"title":"SceneLLM: LLM中用于动态场景图生成的隐式语言推理","authors":"Hang Zhang , Zhuoling Li , Jun Liu","doi":"10.1016/j.patcog.2025.111992","DOIUrl":null,"url":null,"abstract":"<div><div>Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <span><math><mfenced><mrow></mrow></mfenced></math></span>Subject-Predicate-Object<span><math><mfenced><mrow></mrow></mfenced></math></span> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose <em>SceneLLM</em>, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video’s spatio-temporal information. To further improve the LLM’s ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM’s reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of <em>SceneLLM</em> in understanding and generating accurate dynamic scene graphs.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"170 ","pages":"Article 111992"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SceneLLM: Implicit language reasoning in LLM for dynamic scene graph generation\",\"authors\":\"Hang Zhang , Zhuoling Li , Jun Liu\",\"doi\":\"10.1016/j.patcog.2025.111992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <span><math><mfenced><mrow></mrow></mfenced></math></span>Subject-Predicate-Object<span><math><mfenced><mrow></mrow></mfenced></math></span> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose <em>SceneLLM</em>, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video’s spatio-temporal information. To further improve the LLM’s ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM’s reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of <em>SceneLLM</em> in understanding and generating accurate dynamic scene graphs.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"170 \",\"pages\":\"Article 111992\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325006521\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325006521","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

动态场景包含复杂的时空信息，对于移动机器人、无人机和自动驾驶系统做出明智决策至关重要。由于时空复杂性的波动，将这些场景解析为语义三元组（主语-谓语-宾语）以实现准确的场景图生成（SGG）非常具有挑战性。受大型语言模型（llm）推理能力的启发，我们提出了SceneLLM，这是一个利用llm作为动态SGG的强大场景分析器的新框架。我们的框架引入了视频到语言（V2L）映射模块，该模块将视频帧转换为语言信号（场景标记），使llm更容易理解输入。为了更好地编码空间信息，我们设计了一种空间信息聚合（SIA）方案，该方案受汉字结构的启发，将空间数据编码为符号。使用最优传输（OT），我们从帧级令牌序列生成隐式语言信号，该信号捕获视频的时空信息。为了进一步提高LLM处理这种隐式语言输入的能力，我们应用低秩自适应（LoRA）对模型进行微调。最后，我们使用基于变压器的SGG预测器来解码LLM的推理并预测语义三元组。我们的方法在动作基因组（AG）基准上取得了最先进的结果，大量的实验表明了SceneLLM在理解和生成准确的动态场景图方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SceneLLM: Implicit language reasoning in LLM for dynamic scene graph generation

Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets

()

Subject-Predicate-Object

()

for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video’s spatio-temporal information. To further improve the LLM’s ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM’s reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.