A Transformer-based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos

2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) Pub Date : 2023-01-01 DOI:10.1109/WACVW58289.2023.00015

Jannik Koch, Stefan Wolf, Jürgen Beyerer

引用次数: 0

Abstract

Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.

查看原文本刊更多论文

基于变压器的视频细粒度目标识别后期融合机制

细粒度图像分类受限于只考虑单个视图，而在许多情况下，如监控，整个视频存在，提供多个视角。然而，视频的潜力大多是在动作识别的背景下考虑的，而细粒度对象识别很少被认为是视频分类的应用。这导致最近的视频分类架构不适合细粒度对象识别的任务。我们提出了一种新颖的，基于变压器的后期融合机制，用于细粒度视频分类。我们的方法在早期融合机制(如Video Swin Transformer)和简单的基于共识的晚期融合基线(带有现代Swin Transformer骨干)中都取得了优异的结果。此外，我们还提高了效率，因为我们的结果显示，在计算复杂性略有增加的情况下，准确性有了很大的提高。代码可从https://github.com/wolfstefan/tlf获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

自引率

0.00%

发文量