Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong
{"title":"TC-LaVA:重新思考从图像到视频的时空理解转换","authors":"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong","doi":"arxiv-2409.03206","DOIUrl":null,"url":null,"abstract":"Multimodal Large Language Models (MLLMs) have significantly improved\nperformance across various image-language applications. Recently, there has\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\ntasks. However, most efforts concentrate on enhancing the vision encoder and\nprojector components, while the core part, Large Language Models (LLMs),\nremains comparatively under-explored. In this paper, we propose two strategies\nto enhance the model's capability in video understanding tasks by improving\ninter-layer attention computation in LLMs. Specifically, the first approach\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\nTemporal-Aware Dual RoPE, which introduces temporal position information to\nstrengthen the MLLM's temporal modeling capabilities while preserving the\nrelative position relationships of both visual and text tokens. The second\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\nAttention Mask, a simple yet effective method that broadens visual token\ninteractions within and across video frames while maintaining the causal\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\nTC-LLaVA achieves new state-of-the-art performance across various video\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\nvideo-related datasets.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"84 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations\",\"authors\":\"Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong\",\"doi\":\"arxiv-2409.03206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Large Language Models (MLLMs) have significantly improved\\nperformance across various image-language applications. Recently, there has\\nbeen a growing interest in adapting image pre-trained MLLMs for video-related\\ntasks. However, most efforts concentrate on enhancing the vision encoder and\\nprojector components, while the core part, Large Language Models (LLMs),\\nremains comparatively under-explored. In this paper, we propose two strategies\\nto enhance the model's capability in video understanding tasks by improving\\ninter-layer attention computation in LLMs. Specifically, the first approach\\nfocuses on the enhancement of Rotary Position Embedding (RoPE) with\\nTemporal-Aware Dual RoPE, which introduces temporal position information to\\nstrengthen the MLLM's temporal modeling capabilities while preserving the\\nrelative position relationships of both visual and text tokens. The second\\napproach involves enhancing the Attention Mask with the Frame-wise Block Causal\\nAttention Mask, a simple yet effective method that broadens visual token\\ninteractions within and across video frames while maintaining the causal\\ninference mechanism. Based on these proposed methods, we adapt LLaVA for video\\nunderstanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our\\nTC-LLaVA achieves new state-of-the-art performance across various video\\nunderstanding benchmarks with only supervised fine-tuning (SFT) on\\nvideo-related datasets.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"84 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.03206\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03206","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Multimodal Large Language Models (MLLMs) have significantly improved
performance across various image-language applications. Recently, there has
been a growing interest in adapting image pre-trained MLLMs for video-related
tasks. However, most efforts concentrate on enhancing the vision encoder and
projector components, while the core part, Large Language Models (LLMs),
remains comparatively under-explored. In this paper, we propose two strategies
to enhance the model's capability in video understanding tasks by improving
inter-layer attention computation in LLMs. Specifically, the first approach
focuses on the enhancement of Rotary Position Embedding (RoPE) with
Temporal-Aware Dual RoPE, which introduces temporal position information to
strengthen the MLLM's temporal modeling capabilities while preserving the
relative position relationships of both visual and text tokens. The second
approach involves enhancing the Attention Mask with the Frame-wise Block Causal
Attention Mask, a simple yet effective method that broadens visual token
interactions within and across video frames while maintaining the causal
inference mechanism. Based on these proposed methods, we adapt LLaVA for video
understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our
TC-LLaVA achieves new state-of-the-art performance across various video
understanding benchmarks with only supervised fine-tuning (SFT) on
video-related datasets.