用单个视觉转换器联合学习图像和视频

2023 18th International Conference on Machine Vision and Applications (MVA) Pub Date : 2023-07-23 DOI:10.23919/MVA57639.2023.10215661

Shuki Shimizu, Toru Tamaki

{"title":"用单个视觉转换器联合学习图像和视频","authors":"Shuki Shimizu, Toru Tamaki","doi":"10.23919/MVA57639.2023.10215661","DOIUrl":null,"url":null,"abstract":"In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint learning of images and videos with a single Vision Transformer\",\"authors\":\"Shuki Shimizu, Toru Tamaki\",\"doi\":\"10.23919/MVA57639.2023.10215661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.\",\"PeriodicalId\":338734,\"journal\":{\"name\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/MVA57639.2023.10215661\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 18th International Conference on Machine Vision and Applications (MVA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MVA57639.2023.10215661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本研究中，我们提出了一种使用单一模型对图像和视频进行联合学习的方法。一般来说，图像和视频通常由不同的模型进行训练。本文提出了一种将一批图像作为视觉转换器(IV-ViT)的输入，并通过后期融合得到一组具有时间聚合的视频帧的方法。给出了在两个图像数据集和两个动作识别数据集上的实验结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Joint learning of images and videos with a single Vision Transformer

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer (IV-ViT), and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 18th International Conference on Machine Vision and Applications (MVA)

自引率

0.00%

发文量