SumBot:像人类一样总结视频

2020 IEEE International Symposium on Multimedia (ISM) Pub Date : 2020-12-01 DOI:10.1109/ISM.2020.00044

Hongxiang Gu, Stefano Petrangeli, Viswanathan Swaminathan

{"title":"SumBot:像人类一样总结视频","authors":"Hongxiang Gu, Stefano Petrangeli, Viswanathan Swaminathan","doi":"10.1109/ISM.2020.00044","DOIUrl":null,"url":null,"abstract":"Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"330 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"SumBot: Summarize Videos Like a Human\",\"authors\":\"Hongxiang Gu, Stefano Petrangeli, Viswanathan Swaminathan\",\"doi\":\"10.1109/ISM.2020.00044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.\",\"PeriodicalId\":120972,\"journal\":{\"name\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"volume\":\"330 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISM.2020.00044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Multimedia (ISM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2020.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

视频目前占所有互联网流量的70%，预计这一数字将继续增长。每分钟，超过500个小时的视频被上传到YouTube上。对于内容创作者来说，从原始捕获的内容中生成引人入胜的短视频通常是一项耗时且繁琐的活动。现有的基于机器学习的视频摘要和突出显示生成方法往往忽略了这样一个事实，即许多摘要任务需要视频内容的特定领域知识，并且人类编辑在创建摘要时通常遵循半结构化模板(例如为体育赛事创建突出显示)。因此，我们在本文中通过积极地利用这个编辑模板来解决创建特定领域摘要的挑战。特别是，我们提出了一个基于逆强化学习(IRL)的框架，该框架可以在为特定领域生成视频摘要时自动学习人类专家遵循的隐藏结构或模板。特别是，我们建议将视频摘要任务制定为马尔可夫决策过程，其中每个状态是添加到摘要中的视频镜头特征的组合，可能的动作是从摘要中添加/删除一个镜头或保持不变。以一组特定领域的人工生成视频为例，我们采用最大熵IRL算法来学习控制摘要生成过程的隐式奖励函数。然后，学习到的奖励函数被用来训练一个强化学习代理，该代理可以为特定领域生成视频摘要，与人类专家创建的视频摘要非常相似。从专家演示中学习可以使我们的方法适用于任何领域或编辑风格。为了证明我们的方法的卓越性能，我们将其用于足球比赛的突出显示生成任务，并表明它在数量和质量上都优于其他最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SumBot: Summarize Videos Like a Human

Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE International Symposium on Multimedia (ISM)

自引率

0.00%

发文量