{"title":"Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models","authors":"Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao","doi":"10.1109/TCSVT.2024.3462433","DOIUrl":null,"url":null,"abstract":"Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at <uri>https://xuange923.github.io/Surveillance-Video-Understanding</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"300-314"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681489/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at https://xuange923.github.io/Surveillance-Video-Understanding.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.