基于变压器的视频细粒度目标识别后期融合机制

2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) Pub Date : 2023-01-01 DOI:10.1109/WACVW58289.2023.00015

Jannik Koch, Stefan Wolf, Jürgen Beyerer

{"title":"基于变压器的视频细粒度目标识别后期融合机制","authors":"Jannik Koch, Stefan Wolf, Jürgen Beyerer","doi":"10.1109/WACVW58289.2023.00015","DOIUrl":null,"url":null,"abstract":"Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.","PeriodicalId":306545,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Transformer-based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos\",\"authors\":\"Jannik Koch, Stefan Wolf, Jürgen Beyerer\",\"doi\":\"10.1109/WACVW58289.2023.00015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.\",\"PeriodicalId\":306545,\"journal\":{\"name\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WACVW58289.2023.00015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACVW58289.2023.00015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

细粒度图像分类受限于只考虑单个视图，而在许多情况下，如监控，整个视频存在，提供多个视角。然而，视频的潜力大多是在动作识别的背景下考虑的，而细粒度对象识别很少被认为是视频分类的应用。这导致最近的视频分类架构不适合细粒度对象识别的任务。我们提出了一种新颖的，基于变压器的后期融合机制，用于细粒度视频分类。我们的方法在早期融合机制(如Video Swin Transformer)和简单的基于共识的晚期融合基线(带有现代Swin Transformer骨干)中都取得了优异的结果。此外，我们还提高了效率，因为我们的结果显示，在计算复杂性略有增加的情况下，准确性有了很大的提高。代码可从https://github.com/wolfstefan/tlf获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Transformer-based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos

Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

自引率

0.00%

发文量