Yaozong Gan, Ren Togo, Takahiro Ogawa, M. Haseyama
{"title":"Transformer Based Multimodal Scene Recognition in Soccer Videos","authors":"Yaozong Gan, Ren Togo, Takahiro Ogawa, M. Haseyama","doi":"10.1109/ICMEW56448.2022.9859304","DOIUrl":null,"url":null,"abstract":"This paper presents a transformer-based multimodal soccer scene recognition method for both visual and audio modalities. Our approach directly uses the original video frames and audio spectrogram from the soccer video as the input of the transformer model, which can capture the spatial information of the action at a moment and the contextual temporal information between different actions in the soccer videos. We fuse both video frames and audio spectrogram information output from the transformer model in order to better identify scenes that occur in real soccer matches. The late fusion performs a weighted average of visual and audio estimation results to obtain complete information of a soccer scene. We evaluate the proposed method on SoccerNet-V2 dataset and confirm that our method achieves the best performance compared with the recent and state-of-the-art methods.","PeriodicalId":106759,"journal":{"name":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMEW56448.2022.9859304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a transformer-based multimodal soccer scene recognition method for both visual and audio modalities. Our approach directly uses the original video frames and audio spectrogram from the soccer video as the input of the transformer model, which can capture the spatial information of the action at a moment and the contextual temporal information between different actions in the soccer videos. We fuse both video frames and audio spectrogram information output from the transformer model in order to better identify scenes that occur in real soccer matches. The late fusion performs a weighted average of visual and audio estimation results to obtain complete information of a soccer scene. We evaluate the proposed method on SoccerNet-V2 dataset and confirm that our method achieves the best performance compared with the recent and state-of-the-art methods.