A Hybrid Video-to-Text Summarization Framework and Algorithm on Cascading Advanced Extractive- and Abstractive-based Approaches for Supporting Viewers' Video Navigation and Understanding

2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE) Pub Date : 2022-09-01 DOI:10.1109/AIKE55402.2022.00012

Aishwarya Ramakrishnan, Chun-Kit Ngan

{"title":"A Hybrid Video-to-Text Summarization Framework and Algorithm on Cascading Advanced Extractive- and Abstractive-based Approaches for Supporting Viewers' Video Navigation and Understanding","authors":"Aishwarya Ramakrishnan, Chun-Kit Ngan","doi":"10.1109/AIKE55402.2022.00012","DOIUrl":null,"url":null,"abstract":"In this work, we propose the development of a hybrid video-to-text summarization (VTS) framework on cascading the advanced and code-accessible extractive and abstractive (EA) approaches for supporting viewers' video navigation and understanding. More precisely, the contributions of this paper are three-fold. First, we devise an automated and unified hybrid VTS framework that takes an arbitrary video as an input, generates the text transcripts from its human dialogues, and then summarizes the text transcripts into one short video synopsis. Second, we advance the binary merge-sort approach and expand its use to develop an intuitive and heuristic abstractive-based algorithm, with the time complexity $O(T_{L}logT_{L})$ and the space complexity $O(T_{L})$, where TL is the total number of word tokens on a text, to dynamically and successively split and merge a long piece of text transcripts, which exceeds the input text size limitation of an abstractive model, to generate one final semantic video synopsis. At the end, we test the feasibility of applying this proposed framework and algorithm in conducting the preliminarily experimental evaluations on three different videos, as a pilot study, in genres, contents, and lengths. We show that our approach outperforms and/or levels most of the individual EA methods stated above by 75% in terms of the ROUGE F1-Score measurement.","PeriodicalId":441077,"journal":{"name":"2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIKE55402.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this work, we propose the development of a hybrid video-to-text summarization (VTS) framework on cascading the advanced and code-accessible extractive and abstractive (EA) approaches for supporting viewers' video navigation and understanding. More precisely, the contributions of this paper are three-fold. First, we devise an automated and unified hybrid VTS framework that takes an arbitrary video as an input, generates the text transcripts from its human dialogues, and then summarizes the text transcripts into one short video synopsis. Second, we advance the binary merge-sort approach and expand its use to develop an intuitive and heuristic abstractive-based algorithm, with the time complexity $O(T_{L}logT_{L})$ and the space complexity $O(T_{L})$, where TL is the total number of word tokens on a text, to dynamically and successively split and merge a long piece of text transcripts, which exceeds the input text size limitation of an abstractive model, to generate one final semantic video synopsis. At the end, we test the feasibility of applying this proposed framework and algorithm in conducting the preliminarily experimental evaluations on three different videos, as a pilot study, in genres, contents, and lengths. We show that our approach outperforms and/or levels most of the individual EA methods stated above by 75% in terms of the ROUGE F1-Score measurement.

查看原文本刊更多论文

在这项工作中，我们提出了一种混合视频到文本摘要(VTS)框架的开发，该框架用于级联高级和代码可访问的提取和抽象(EA)方法，以支持观众的视频导航和理解。更准确地说，本文的贡献有三个方面。首先，我们设计了一个自动化和统一的混合VTS框架，该框架将任意视频作为输入，从其人类对话中生成文本文本，然后将文本文本总结为一个短视频摘要。其次，我们将二元合并排序方法推进并扩展其应用，开发了一种直观、启发式的基于抽象的算法，其时间复杂度为$O(T_{L}logT_{L})$，空间复杂度为$O(T_{L})$，其中TL为文本上的单词标记总数，对超出抽象模型输入文本大小限制的长文本文本进行动态、连续的拆分和合并，生成最终的语义视频摘要。最后，我们测试了应用该框架和算法的可行性，并对三个不同的视频进行了初步的实验评估，作为试点研究，在类型、内容和长度上进行了初步的实验评估。我们表明，就ROUGE F1-Score测量而言，我们的方法优于和/或高于上述大多数单独的EA方法75%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Fifth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)

自引率

0.00%

发文量