Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen
{"title":"Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering","authors":"Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen","doi":"10.1109/TCSVT.2025.3556422","DOIUrl":null,"url":null,"abstract":"Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at <uri>https://github.com/qyx1121/T-MoENet</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9003-9016"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10946169/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at https://github.com/qyx1121/T-MoENet.
零镜头视频问答的时间引导混合专家
视频问答(VideoQA)是视觉语言领域的一项具有挑战性的任务。由于对问答对标注过程耗时费力,完全监督的方法已经不适合当前日益增长的数据需求。这导致了零射击VideoQA的兴起,一些作品建议采用大型语言模型(llm)来辅助零射击学习。尽管最近取得了进展,但llm在理解视频中的时间信息方面的不足,以及对时间差异(例如场景或对象之间的不同动态变化)的忽视,仍然没有得到现有零镜头VideoQA尝试的充分解决。针对这些挑战,本文提出了一种新的时间导向专家混合网络(T-MoENet)用于零镜头视频问答。具体来说,我们使用一个时间模块来赋予语言模型感知时间信息的能力。然后提出了一个时间引导的混合专家模块,以进一步学习不同视频中呈现的时间差异。使模型能够有效地提高泛化能力。我们提出的方法在多个零射击VideoQA基准测试中实现了最先进的性能,在tgf - frameqa和MSRVTT-QA上的准确率分别提高了5.6%和2.3%,同时在完全监督设置中与其他方法保持竞争。本研究制定的规范和模型将在https://github.com/qyx1121/T-MoENet上公开提供。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信