{"title":"SumBot:像人类一样总结视频","authors":"Hongxiang Gu, Stefano Petrangeli, Viswanathan Swaminathan","doi":"10.1109/ISM.2020.00044","DOIUrl":null,"url":null,"abstract":"Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.","PeriodicalId":120972,"journal":{"name":"2020 IEEE International Symposium on Multimedia (ISM)","volume":"330 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"SumBot: Summarize Videos Like a Human\",\"authors\":\"Hongxiang Gu, Stefano Petrangeli, Viswanathan Swaminathan\",\"doi\":\"10.1109/ISM.2020.00044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.\",\"PeriodicalId\":120972,\"journal\":{\"name\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"volume\":\"330 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Multimedia (ISM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISM.2020.00044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Multimedia (ISM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2020.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Video currently accounts for 70% of all internet traffic and this number is expected to continue to grow. Each minute, more than 500 hours worth of videos are uploaded on YouTube. Generating engaging short videos out of the raw captured content is often a time-consuming and cumbersome activity for content creators. Existing ML- based video summarization and highlight generation approaches often neglect the fact that many summarization tasks require specific domain knowledge of the video content, and that human editors often follow a semistructured template when creating the summary (e.g. to create the highlights for a sport event). We therefore address in this paper the challenge of creating domain-specific summaries, by actively leveraging this editorial template. Particularly, we present an Inverse Reinforcement Learning (IRL)-based framework that can automatically learn the hidden structure or template followed by a human expert when generating a video summary for a specific domain. Particularly, we propose to formulate the video summarization task as a Markov Decision Process, where each state is a combination of the features of the video shots added to the summary, and the possible actions are to include/remove a shot from the summary or leave it as is. Using a set of domain-specific human-generated video highlights as examples, we employ a Maximum Entropy IRL algorithm to learn the implicit reward function governing the summary generation process. The learned reward function is then used to train an RL-agent that can produce video summaries for a specific domain, closely resembling what a human expert would create. Learning from expert demonstrations allows our approach to be applicable to any domain or editorial styles. To demonstrate the superior performance of our approach, we employ it to the task of soccer games highlight generation and show that it outperforms other state-of-the-art methods, both quantitatively and qualitatively.