Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models

IF 8.3 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao
{"title":"Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models","authors":"Tongtong Yuan;Xuange Zhang;Bo Liu;Kun Liu;Jian Jin;Zhenzhen Jiao","doi":"10.1109/TCSVT.2024.3462433","DOIUrl":null,"url":null,"abstract":"Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at <uri>https://xuange923.github.io/Surveillance-Video-Understanding</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"300-314"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681489/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Surveillance videos play a crucial role in public security. However, current tasks related to surveillance videos primarily focus on classifying and localizing anomalous events. Despite achieving notable performance, existing methods are restricted to detecting and classifying predefined events and lack satisfactory semantic understanding. To tackle this challenge, we introduce a novel research avenue focused on Video-and-Language Understanding for surveillance (VALU), and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours. Moreover, we evaluate SOTA models on five multimodal tasks using this newly created dataset, establishing new baselines for surveillance VALU, from small to large models. Our experiments reveal that mainstream models, which perform well on previously public datasets, exhibit poor performance on surveillance video, highlighting new challenges in surveillance VALU. In addition to conducting baseline experiments to compare the performance of existing models, we also propose novel methods for multimodal anomaly detection tasks and finetune multimodal large language model models using our dataset. All the experiments highlight the necessity of constructing this multimodal dataset to advance surveillance AI. Upon the experimental results mentioned above, we conduct further in-depth analysis and discussion. The dataset and codes are provided at https://xuange923.github.io/Surveillance-Video-Understanding.
监控视频与语言理解:从小型多模态模型到大型多模态模型
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信