基于时空关系图推理的复杂事件识别

2021 International Conference on Visual Communications and Image Processing (VCIP) Pub Date : 2021-12-05 DOI:10.1109/VCIP53242.2021.9675337

Hua Lin, Hongtian Zhao, Hua Yang

{"title":"基于时空关系图推理的复杂事件识别","authors":"Hua Lin, Hongtian Zhao, Hua Yang","doi":"10.1109/VCIP53242.2021.9675337","DOIUrl":null,"url":null,"abstract":"Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Complex Event Recognition via Spatial-Temporal Relation Graph Reasoning\",\"authors\":\"Hua Lin, Hongtian Zhao, Hua Yang\",\"doi\":\"10.1109/VCIP53242.2021.9675337\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.\",\"PeriodicalId\":114062,\"journal\":{\"name\":\"2021 International Conference on Visual Communications and Image Processing (VCIP)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Visual Communications and Image Processing (VCIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/VCIP53242.2021.9675337\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Visual Communications and Image Processing (VCIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VCIP53242.2021.9675337","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视频中的事件通常包含多种因素:对象、环境、动作以及它们之间的交互关系，这些因素作为中间层语义可以弥补事件类别与视频剪辑之间的差距。本文提出了一种新的视频事件识别方法，利用图卷积网络来表示和推理内部因素之间的逻辑关系。考虑到不同类型的事件可能关注不同的因素，我们特别使用变压器网络提取时空特征，利用注意力机制自适应地为相关关键因素分配权重。虽然变压器通常更依赖于大型数据集，但我们展示了在变压器之前应用二维卷积主干的有效性。我们在具有挑战性的视频事件识别数据集UCF-Crime上训练和测试我们的框架，并进行消融研究。实验结果表明，我们的方法达到了最先进的性能，优于以前主要的先进模型，识别精度有显着的边际。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Complex Event Recognition via Spatial-Temporal Relation Graph Reasoning

Events in videos usually contain a variety of factors: objects, environments, actions, and their interaction relations, and these factors as the mid-level semantics can bridge the gap between the event categories and the video clips. In this paper, we present a novel video events recognition method that uses the graph convolution networks to represent and reason the logic relations among the inner factors. Considering that different kinds of events may focus on different factors, we especially use the transformer networks to extract the spatial-temporal features drawing upon the attention mechanism that can adaptively assign weights to concerned key factors. Although transformers generally rely more on large datasets, we show the effectiveness of applying a 2D convolution backbone before the transformers. We train and test our framework on the challenging video event recognition dataset UCF-Crime and conduct ablation studies. The experimental results show that our method achieves state-of-the-art performance, outperforming previous principal advanced models with a significant margin of recognition accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Visual Communications and Image Processing (VCIP)

自引率

0.00%

发文量