用于视频问题解答的变压器供电不变接地。

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2023-08-09 DOI:10.1109/TPAMI.2023.3303451

Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua

{"title":"用于视频问题解答的变压器供电不变接地。","authors":"Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua","doi":"10.1109/TPAMI.2023.3303451","DOIUrl":null,"url":null,"abstract":"Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2023-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-Empowered Invariant Grounding for Video Question Answering.\",\"authors\":\"Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, Tat-Seng Chua\",\"doi\":\"10.1109/TPAMI.2023.3303451\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"PP \",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2023-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2023.3303451\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/TPAMI.2023.3303451","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视频问题解答（VideoQA）是一项回答视频问题的任务。其核心是理解视频场景与问题语义之间的关联，从而得出答案。在主流的 VideoQA 模型中，典型的学习目标--经验风险最小化（ERM）--往往会过度利用与问题无关的场景和答案之间的虚假相关性，而不是检查问题关键场景的因果效应，从而以不可靠的推理破坏预测。在这项工作中，我们对 VideoQA 进行了因果分析，并提出了一种模式无关的学习框架，名为 "VideoQA 的不变基础"（Invariant Grounding for VideoQA，IGV），用于确定问题关键场景的基础，而问题关键场景与答案之间的因果关系在不同的补充干预中是不变的。有了 IGV，领先的视频质量保证模型就能使答案免受虚假相关性的负面影响，从而大大提高了推理能力。为了释放这一框架的潜力，我们进一步提供了变压器驱动的视频质量保证不变接地（TIGV），它是 IGV 框架的一个实质性实例化，自然地将不变接地的思想集成到了变压器式骨干网中。在四个基准数据集上进行的实验验证了我们的设计在准确性、视觉可解释性和泛化能力方面优于领先的基线。我们的代码见 https://github.com/yl3800/TIGV。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transformer-Empowered Invariant Grounding for Video Question Answering.

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning. In this work, we take a causal look at VideoQA and propose a modal-agnostic learning framework, named Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, leading VideoQA models are forced to shield the answering from the negative influence of spurious correlations, which significantly improves their reasoning ability. To unleash the potential of this framework, we further provide a Transformer-Empowered Invariant Grounding for VideoQA (TIGV), a substantial instantiation of IGV framework that naturally integrates the idea of invariant grounding into a transformer-style backbone. Experiments on four benchmark datasets validate our design in terms of accuracy, visual explainability, and generalization ability over the leading baselines. Our code is available at https://github.com/yl3800/TIGV.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.