Human-Centric Behavior Description in Videos: New Benchmark and Model

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-07-02 DOI:10.1109/TMM.2024.3414263

Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang

{"title":"Human-Centric Behavior Description in Videos: New Benchmark and Model","authors":"Lingru Zhou;Yiqi Gao;Manqing Zhang;Peng Wu;Peng Wang;Yanning Zhang","doi":"10.1109/TMM.2024.3414263","DOIUrl":null,"url":null,"abstract":"In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10867-10878"},"PeriodicalIF":8.4000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10582309/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results.

查看原文本刊更多论文

视频中以人为中心的行为描述：新基准和模型

在视频监控领域，描述视频中每个人的行为变得越来越重要，尤其是在有多人在场的复杂场景中。这是因为描述每个人的行为可以提供更详细的情景分析，从而准确评估和应对潜在风险，确保公共场所的安全与和谐。目前，视频级字幕数据集无法对每个人的具体行为进行精细描述。然而，仅凭视频级的描述无法对个体行为进行深入解读，因此准确确定每个个体的具体身份具有挑战性。为了应对这一挑战，我们构建了一个以人为中心的视频监控字幕数据集，该数据集提供了 7820 个个体动态行为的详细描述。具体来说，我们对每个人的多个方面进行了标注，如位置、服装以及与场景中其他元素的互动，这些人分布在 1,012 个视频中。基于这个数据集，我们可以将个人与他们各自的行为联系起来，从而进一步分析每个人在监控视频中的行为。除了数据集之外，我们还提出了一种新颖的视频字幕方法，该方法可以在个人层面上详细描述个人行为，取得了最先进的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.