Yi Liu;Haowen Hou;Fei Ma;Shiguang Ni;Fei Richard Yu
{"title":"MLLM-TA:利用多模态大语言模型进行精确的时间视频接地","authors":"Yi Liu;Haowen Hou;Fei Ma;Shiguang Ni;Fei Richard Yu","doi":"10.1109/LSP.2024.3511426","DOIUrl":null,"url":null,"abstract":"In untrimmed video tasks, identifying temporal boundaries in videos is crucial for temporal video grounding. With the emergence of multimodal large language models (MLLMs), recent studies have focused on endowing these models with the capability of temporal perception in untrimmed videos. To address the challenge, in this paper, we introduce a multimodal large language model named MLLM-TA with precise temporal perception to obtain temporal attention. Unlike the traditional MLLMs, answering temporal questions through one or two words related to temporal information, we leverage the text description proficiency of MLLMs to acquire video temporal attention with description. Specifically, we design a dual temporal-aware generative branches aimed at the visual space of the entire video and the textual space of global descriptions, simultaneously generating mutually supervised consistent temporal attention, thereby enhancing the video temporal perception capabilities of MLLMs. Finally, we evaluate our approach on both video grounding task and highlight detection task on three popular benchmarks, including Charades-STA, ActivityNet Captions and QVHighlights. The extensive results show that our MLLM-TA significantly outperforms previous approaches both on zero-shot and supervised setting, achieving state-of-the-art performance.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"281-285"},"PeriodicalIF":3.2000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding\",\"authors\":\"Yi Liu;Haowen Hou;Fei Ma;Shiguang Ni;Fei Richard Yu\",\"doi\":\"10.1109/LSP.2024.3511426\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In untrimmed video tasks, identifying temporal boundaries in videos is crucial for temporal video grounding. With the emergence of multimodal large language models (MLLMs), recent studies have focused on endowing these models with the capability of temporal perception in untrimmed videos. To address the challenge, in this paper, we introduce a multimodal large language model named MLLM-TA with precise temporal perception to obtain temporal attention. Unlike the traditional MLLMs, answering temporal questions through one or two words related to temporal information, we leverage the text description proficiency of MLLMs to acquire video temporal attention with description. Specifically, we design a dual temporal-aware generative branches aimed at the visual space of the entire video and the textual space of global descriptions, simultaneously generating mutually supervised consistent temporal attention, thereby enhancing the video temporal perception capabilities of MLLMs. Finally, we evaluate our approach on both video grounding task and highlight detection task on three popular benchmarks, including Charades-STA, ActivityNet Captions and QVHighlights. The extensive results show that our MLLM-TA significantly outperforms previous approaches both on zero-shot and supervised setting, achieving state-of-the-art performance.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"281-285\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10777595/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10777595/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
MLLM-TA: Leveraging Multimodal Large Language Models for Precise Temporal Video Grounding
In untrimmed video tasks, identifying temporal boundaries in videos is crucial for temporal video grounding. With the emergence of multimodal large language models (MLLMs), recent studies have focused on endowing these models with the capability of temporal perception in untrimmed videos. To address the challenge, in this paper, we introduce a multimodal large language model named MLLM-TA with precise temporal perception to obtain temporal attention. Unlike the traditional MLLMs, answering temporal questions through one or two words related to temporal information, we leverage the text description proficiency of MLLMs to acquire video temporal attention with description. Specifically, we design a dual temporal-aware generative branches aimed at the visual space of the entire video and the textual space of global descriptions, simultaneously generating mutually supervised consistent temporal attention, thereby enhancing the video temporal perception capabilities of MLLMs. Finally, we evaluate our approach on both video grounding task and highlight detection task on three popular benchmarks, including Charades-STA, ActivityNet Captions and QVHighlights. The extensive results show that our MLLM-TA significantly outperforms previous approaches both on zero-shot and supervised setting, achieving state-of-the-art performance.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.