Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen
{"title":"基于情境识别的知识整合","authors":"Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen","doi":"10.1016/j.patcog.2025.111766","DOIUrl":null,"url":null,"abstract":"<div><div>Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (<em>e</em>.<em>g</em>., <span>sketching</span>), detecting related semantic roles (<em>e</em>.<em>g</em>., AGENT is <span>man</span>), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel <u><strong>K</strong></u>n<u><strong>O</strong></u>wledge <u><strong>I</strong></u>ntegration (<strong>KOI</strong>) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), <em>e</em>.<em>g</em>., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"167 ","pages":"Article 111766"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Knowledge Integration for Grounded Situation Recognition\",\"authors\":\"Jiaming Lei , Sijing Wu , Lin Li , Lei Chen , Jun Xiao , Yi Yang , Long Chen\",\"doi\":\"10.1016/j.patcog.2025.111766\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (<em>e</em>.<em>g</em>., <span>sketching</span>), detecting related semantic roles (<em>e</em>.<em>g</em>., AGENT is <span>man</span>), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel <u><strong>K</strong></u>n<u><strong>O</strong></u>wledge <u><strong>I</strong></u>ntegration (<strong>KOI</strong>) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), <em>e</em>.<em>g</em>., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"167 \",\"pages\":\"Article 111766\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325004261\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325004261","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Knowledge Integration for Grounded Situation Recognition
Grounded Situation Recognition (GSR) involves interpreting complex events in images by identifying key verbs (e.g., sketching), detecting related semantic roles (e.g., AGENT is man), and localizing noun entities with bounding boxes. Due to the inherent semantic correlations between verbs and noun entities, existing methods predominantly focus on leveraging these correlations to refine verb predictions using noun entities, or vice versa. However, these approaches often disregard the long-tailed distributions inherent in training dataset, resulting in biased predictions and poor accuracy when recognizing less frequent noun entities and verbs. To tackle this issue, we introduce a novel KnOwledge Integration (KOI) strategy that alleviates the bias by distinctively merging two types of knowledge: general knowledge and downstream knowledge of GSR-specific. Specifically, the integration employs vision-language models (VLMs), e.g., CLIP, for extracting expansive, contextual general knowledge, potentially beneficial for tail category recognition, and harnesses pre-trained GSR models for detailed, domain-focused downstream knowledge, typically advantageous for head category recognition. To bridge general and specific gaps, we devise a trade-off weighting strategy to effectively merge these diverse insights, ensuring a robust prediction that is not extremely biased towards either head or tail categories. KOI’s model-agnostic nature facilitates its integration into various GSR frameworks, proving its universality. Extensive experimental results on the SWiG dataset demonstrate that KOI significantly outperforms existing methods, establishing new state-of-the-art performance across multiple metrics.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.