A Deep Architecture for Multimodal Summarization of Soccer Games

MMSports '19 Pub Date : 2019-10-15 DOI:10.1145/3347318.3355524

Melissa Sanabria, Sherly, F. Precioso, Thomas Menguy

{"title":"A Deep Architecture for Multimodal Summarization of Soccer Games","authors":"Melissa Sanabria, Sherly, F. Precioso, Thomas Menguy","doi":"10.1145/3347318.3355524","DOIUrl":null,"url":null,"abstract":"The massive growth of sports videos, specially in soccer, has resulted in a need for the automatic generation of summaries, where the objective is not only to show the most important actions of the match but also to elicit as much emotion as the ones bring upon by human editors. State-of-the-art methods on video summarization mostly rely on video processing, however this is not an optimal approach for long videos such as soccer matches. In this paper we propose a multimodal approach to automatically generate summaries of soccer match videos that consider both event and audio features. The event features get a shorter and better representation of the match, and the audio helps detect the excitement generated by the game. Our method consists of three consecutive stages: Proposals, Summarization and Content Refinement. The first one generates summary proposals, using Multiple Instance Learning to deal with the similarity between the events inside the summary and the rest of the match. The Summarization stage uses event and audio features as input of a hierarchical Recurrent Neural Network to decide which proposals should indeed be in the summary. And the last stage, takes advantage of the visual content to create the final summary. The results show that our approach outperforms by a large margin not only the video processing methods but also methods that use event and audio features.","PeriodicalId":322390,"journal":{"name":"MMSports '19","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MMSports '19","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347318.3355524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

The massive growth of sports videos, specially in soccer, has resulted in a need for the automatic generation of summaries, where the objective is not only to show the most important actions of the match but also to elicit as much emotion as the ones bring upon by human editors. State-of-the-art methods on video summarization mostly rely on video processing, however this is not an optimal approach for long videos such as soccer matches. In this paper we propose a multimodal approach to automatically generate summaries of soccer match videos that consider both event and audio features. The event features get a shorter and better representation of the match, and the audio helps detect the excitement generated by the game. Our method consists of three consecutive stages: Proposals, Summarization and Content Refinement. The first one generates summary proposals, using Multiple Instance Learning to deal with the similarity between the events inside the summary and the rest of the match. The Summarization stage uses event and audio features as input of a hierarchical Recurrent Neural Network to decide which proposals should indeed be in the summary. And the last stage, takes advantage of the visual content to create the final summary. The results show that our approach outperforms by a large margin not only the video processing methods but also methods that use event and audio features.

查看原文本刊更多论文

足球比赛多模态摘要的深度体系结构

体育视频的大量增长，特别是足球视频，导致了对自动生成摘要的需求，其目标不仅是显示比赛中最重要的动作，而且要像人类编辑一样引起人们的情感。目前最先进的视频总结方法主要依赖于视频处理，然而对于像足球比赛这样的长视频来说，这并不是一个最佳的方法。在本文中，我们提出了一种多模式的方法来自动生成足球比赛视频的摘要，同时考虑事件和音频特征。事件功能能够更短且更好地呈现比赛，而音频则能够帮助玩家发现游戏所产生的兴奋感。我们的方法包括三个连续的阶段:建议、总结和内容细化。第一个生成摘要建议，使用多实例学习来处理摘要内的事件与其他匹配事件之间的相似性。摘要阶段使用事件和音频特征作为分层递归神经网络的输入，以决定哪些提案确实应该在摘要中。最后阶段，利用视觉内容进行最后的总结。结果表明，我们的方法不仅大大优于视频处理方法，而且优于使用事件和音频特征的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

MMSports '19

自引率

0.00%

发文量