TC-LaVA:重新思考从图像到视频的时空理解转换

Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong
{"title":"TC-LaVA:重新思考从图像到视频的时空理解转换","authors":"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong","doi":"arxiv-2409.03206","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) have significantly improved\nperformance across various image-language applications. Recently, there has\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\ntasks. However, most efforts concentrate on enhancing the vision encoder and\nprojector components, while the core part, Large Language Models (LLMs),\nremains comparatively under-explored. In this paper, we propose two strategies\nto enhance the model's capability in video understanding tasks by improving\ninter-layer attention computation in LLMs. Specifically, the first approach\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\nTemporal-Aware Dual RoPE, which introduces temporal position information to\nstrengthen the MLLM's temporal modeling capabilities while preserving the\nrelative position relationships of both visual and text tokens. The second\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\nAttention Mask, a simple yet effective method that broadens visual token\ninteractions within and across video frames while maintaining the causal\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\nTC-LLaVA achieves new state-of-the-art performance across various video\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\nvideo-related datasets.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"84 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations\",\"authors\":\"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong\",\"doi\":\"arxiv-2409.03206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Large Language Models (MLLMs) have significantly improved\\nperformance across various image-language applications. Recently, there has\\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\\ntasks. However, most efforts concentrate on enhancing the vision encoder and\\nprojector components, while the core part, Large Language Models (LLMs),\\nremains comparatively under-explored. In this paper, we propose two strategies\\nto enhance the model's capability in video understanding tasks by improving\\ninter-layer attention computation in LLMs. Specifically, the first approach\\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\\nTemporal-Aware Dual RoPE, which introduces temporal position information to\\nstrengthen the MLLM's temporal modeling capabilities while preserving the\\nrelative position relationships of both visual and text tokens. The second\\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\\nAttention Mask, a simple yet effective method that broadens visual token\\ninteractions within and across video frames while maintaining the causal\\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\\nTC-LLaVA achieves new state-of-the-art performance across various video\\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\\nvideo-related datasets.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"84 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03206\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

多模态大语言模型(MLLM)大大提高了各种图像语言应用的性能。最近,人们对将图像预训练 MLLM 用于视频相关任务的兴趣日益浓厚。然而,大多数研究都集中在增强视觉编码器和投影仪组件上,而核心部分--大型语言模型(LLMs)--的研究则相对较少。在本文中,我们提出了两种策略,通过改进 LLM 的层间注意力计算来增强模型在视频理解任务中的能力。具体来说,第一种方法主要是利用时态感知双 RoPE 增强旋转位置嵌入(RoPE),引入时态位置信息来加强 MLLM 的时态建模能力,同时保留视觉和文本标记的相对位置关系。第二种方法是使用 "帧块因果注意掩码"(Frame-wise Block CausalAttention Mask)来增强注意掩码,这是一种简单而有效的方法,可在保持因果推理机制的同时,扩大视觉标记在视频帧内和视频帧间的交互。在这些方法的基础上,我们将 LLaVA 用于视频理解任务,并将其命名为时态考虑 LLaVA(TC-LaVA)。在各种视频理解基准测试中,我们的TC-LaVA只需在视频相关数据集上进行有监督的微调(SFT),就能获得最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信