基于情境识别的知识整合

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-05-05 DOI:10.1016/j.patcog.2025.111766

Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen

{"title":"基于情境识别的知识整合","authors":"Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen","doi":"10.1016/j.patcog.2025.111766","DOIUrl":null,"url":null,"abstract":"<div><div>Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"167 ","pages":"Article 111766"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Knowledge Integration for Grounded Situation Recognition\",\"authors\":\"Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen\",\"doi\":\"10.1016/j.patcog.2025.111766\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"167 \",\"pages\":\"Article 111766\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325004261\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325004261","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

接地情景识别（GSR）涉及通过识别关键动词（例如，素描），检测相关语义角色（例如，AGENT是人）以及使用边界框定位名词实体来解释图像中的复杂事件。由于动词和名词实体之间固有的语义相关性，现有的方法主要集中在利用这些相关性来使用名词实体来改进动词预测，反之亦然。然而，这些方法往往忽略了训练数据集中固有的长尾分布，导致在识别不太频繁的名词实体和动词时产生偏见和准确性差。为了解决这一问题，我们引入了一种新的知识集成（KOI）策略，该策略通过独特地合并两种类型的知识：一般知识和gsr特定的下游知识来减轻偏见。具体而言，该集成采用视觉语言模型（VLMs），例如CLIP，用于提取扩展的、上下文相关的一般知识，这可能有利于尾部类别识别，并利用预训练的GSR模型来获取详细的、专注于领域的下游知识，这通常有利于头部类别识别。为了弥合一般和特定的差距，我们设计了一种权衡加权策略，以有效地合并这些不同的见解，确保稳健的预测不会极度偏向于头或尾类别。KOI的模型不可知性有助于其集成到各种GSR框架中，证明了其普遍性。在SWiG数据集上的大量实验结果表明，KOI显著优于现有方法，在多个指标上建立了新的最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Knowledge Integration for Grounded Situation Recognition

Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.