Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen
{"title":"零镜头视频问答的时间引导混合专家","authors":"Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen","doi":"10.1109/TCSVT.2025.3556422","DOIUrl":null,"url":null,"abstract":"Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at <uri>https://github.com/qyx1121/T-MoENet</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9003-9016"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering\",\"authors\":\"Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen\",\"doi\":\"10.1109/TCSVT.2025.3556422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at <uri>https://github.com/qyx1121/T-MoENet</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"9003-9016\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10946169/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10946169/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering
Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at https://github.com/qyx1121/T-MoENet.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.