TC-LaVA：重新思考从图像到视频的时空理解转换

arXiv - CS - Artificial Intelligence Pub Date : 2024-09-05 DOI:arxiv-2409.03206

Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong

{"title":"TC-LaVA：重新思考从图像到视频的时空理解转换","authors":"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong","doi":"arxiv-2409.03206","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) have significantly improved\nperformance across various image-language applications. Recently, there has\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\ntasks. However, most efforts concentrate on enhancing the vision encoder and\nprojector components, while the core part, Large Language Models (LLMs),\nremains comparatively under-explored. In this paper, we propose two strategies\nto enhance the model's capability in video understanding tasks by improving\ninter-layer attention computation in LLMs. Specifically, the first approach\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\nTemporal-Aware Dual RoPE, which introduces temporal position information to\nstrengthen the MLLM's temporal modeling capabilities while preserving the\nrelative position relationships of both visual and text tokens. The second\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\nAttention Mask, a simple yet effective method that broadens visual token\ninteractions within and across video frames while maintaining the causal\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\nTC-LLaVA achieves new state-of-the-art performance across various video\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\nvideo-related datasets.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"84 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations\",\"authors\":\"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong\",\"doi\":\"arxiv-2409.03206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Large Language Models (MLLMs) have significantly improved\\nperformance across various image-language applications. Recently, there has\\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\\ntasks. However, most efforts concentrate on enhancing the vision encoder and\\nprojector components, while the core part, Large Language Models (LLMs),\\nremains comparatively under-explored. In this paper, we propose two strategies\\nto enhance the model's capability in video understanding tasks by improving\\ninter-layer attention computation in LLMs. Specifically, the first approach\\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\\nTemporal-Aware Dual RoPE, which introduces temporal position information to\\nstrengthen the MLLM's temporal modeling capabilities while preserving the\\nrelative position relationships of both visual and text tokens. The second\\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\\nAttention Mask, a simple yet effective method that broadens visual token\\ninteractions within and across video frames while maintaining the causal\\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\\nTC-LLaVA achieves new state-of-the-art performance across various video\\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\\nvideo-related datasets.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"84 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03206\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多模态大语言模型（MLLM）大大提高了各种图像语言应用的性能。最近，人们对将图像预训练 MLLM 用于视频相关任务的兴趣日益浓厚。然而，大多数研究都集中在增强视觉编码器和投影仪组件上，而核心部分--大型语言模型（LLMs）--的研究则相对较少。在本文中，我们提出了两种策略，通过改进 LLM 的层间注意力计算来增强模型在视频理解任务中的能力。具体来说，第一种方法主要是利用时态感知双 RoPE 增强旋转位置嵌入（RoPE），引入时态位置信息来加强 MLLM 的时态建模能力，同时保留视觉和文本标记的相对位置关系。第二种方法是使用 "帧块因果注意掩码"（Frame-wise Block CausalAttention Mask）来增强注意掩码，这是一种简单而有效的方法，可在保持因果推理机制的同时，扩大视觉标记在视频帧内和视频帧间的交互。在这些方法的基础上，我们将 LLaVA 用于视频理解任务，并将其命名为时态考虑 LLaVA（TC-LaVA）。在各种视频理解基准测试中，我们的TC-LaVA只需在视频相关数据集上进行有监督的微调（SFT），就能获得最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量