{"title":"基于时空显著性的无偏对比学习视频场景图生成","authors":"Weijun Zhuang;Bowen Dong;Zhilin Zhu;Zhijun Li;Jie Liu;Yaowei Wang;Xiaopeng Hong;Xin Li;Wangmeng Zuo","doi":"10.1109/TMM.2025.3557688","DOIUrl":null,"url":null,"abstract":"Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3092-3104"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation\",\"authors\":\"Weijun Zhuang;Bowen Dong;Zhilin Zhu;Zhijun Li;Jie Liu;Yaowei Wang;Xiaopeng Hong;Xin Li;Wangmeng Zuo\",\"doi\":\"10.1109/TMM.2025.3557688\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"3092-3104\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948366/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948366/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Spatial-Temporal Saliency Guided Unbiased Contrastive Learning for Video Scene Graph Generation
Accurately detecting objects and their interrelationships for Video Scene Graph Generation (VidSGG) confronts two primary challenges. The first involves the identification of active objects interacting with humans from the numerous background objects, while the second challenge is long-tailed distribution among predicate classes. To tackle these challenges, we propose STABILE, a novel framework with a spatial-temporal saliency-guided contrastive learning scheme. For the first challenge, STABILE features an active object retriever that includes an object saliency fusion block for enhancing object embeddings with motion cues alongside an object temporal encoder to capture temporal dependencies. For the second challenge, STABILE introduces an unbiased relationship representation learning module with an Unbiased Multi-Label (UML) contrastive loss to mitigate the effect of long-tailed distribution. With the enhancements in both aspects, STABILE substantially boosts the accuracy of scene graph generation. Extensive experiments demonstrate the superiority of STABILE, setting new benchmarks in the field by offering enhanced accuracy and unbiased scene graph generation.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.