Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-17 DOI:10.1109/TCSVT.2024.3462433

Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao

{"title":"Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models","authors":"Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao","doi":"10.1109/TCSVT.2024.3462433","DOIUrl":null,"url":null,"abstract":"Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at <uri>https://xuange923.github.io/Surveillance-Video-Understanding</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"300-314"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681489/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at https://xuange923.github.io/Surveillance-Video-Understanding.

查看原文本刊更多论文

监控视频与语言理解：从小型多模态模型到大型多模态模型

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.