Multimodal Fusion of Speech and Text using Semi-supervised LDA for Indexing Lecture Videos

2019 National Conference on Communications (NCC) Pub Date : 2019-02-01 DOI:10.1109/NCC.2019.8732253

M. Husain, S. Meena

{"title":"Multimodal Fusion of Speech and Text using Semi-supervised LDA for Indexing Lecture Videos","authors":"M. Husain, S. Meena","doi":"10.1109/NCC.2019.8732253","DOIUrl":null,"url":null,"abstract":"Lecture videos are the most popular learning materials due to their pedagogical benefits. However, accessing a topic or subtopic of interest requires manual examination of each frame of the video and it is more tedious when the volume and length of videos increases. The main problem thus becomes the efficient automatic segmentation and indexing of lecture videos that enables faster retrieval of specific and relevant content. In this paper, we present automatic indexing of lecture videos using topic hierarchies extracted from slide text and audio transcripts. Indexing videos based on slide text information is more accurate due to higher character recognition rates but, text content is very abstract and subjective. In contrast to slide text, audio transcripts provide comprehensive details about the topics, however retrieval results are imprecise due to higher WER. In order to address this problem, we propose a novel idea of fusing complementary strengths of slide text and audio transcript information using semi-supervised LDA algorithm. Further, we strive to improve learning of the model by utilizing words recognized from video slides as seed words and train the model to learn the distribution of video transcriptions around these seed words. We test the performance of proposed multimodal indexing scheme on 500 number of class room videos downloaded from Coursera, NPTEL and KLETU (KLE Technological University) classroom videos. The proposed multimodal fusion based scheme achieves an average percentage improvement of 44.49% F-Score compared with indexing using unimodal approaches.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"14 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Lecture videos are the most popular learning materials due to their pedagogical benefits. However, accessing a topic or subtopic of interest requires manual examination of each frame of the video and it is more tedious when the volume and length of videos increases. The main problem thus becomes the efficient automatic segmentation and indexing of lecture videos that enables faster retrieval of specific and relevant content. In this paper, we present automatic indexing of lecture videos using topic hierarchies extracted from slide text and audio transcripts. Indexing videos based on slide text information is more accurate due to higher character recognition rates but, text content is very abstract and subjective. In contrast to slide text, audio transcripts provide comprehensive details about the topics, however retrieval results are imprecise due to higher WER. In order to address this problem, we propose a novel idea of fusing complementary strengths of slide text and audio transcript information using semi-supervised LDA algorithm. Further, we strive to improve learning of the model by utilizing words recognized from video slides as seed words and train the model to learn the distribution of video transcriptions around these seed words. We test the performance of proposed multimodal indexing scheme on 500 number of class room videos downloaded from Coursera, NPTEL and KLETU (KLE Technological University) classroom videos. The proposed multimodal fusion based scheme achieves an average percentage improvement of 44.49% F-Score compared with indexing using unimodal approaches.

查看原文本刊更多论文

基于半监督LDA的语音与文本多模态融合索引讲座视频

讲座视频由于其教学效益是最受欢迎的学习材料。然而，访问感兴趣的主题或子主题需要手动检查视频的每一帧，并且当视频的体积和长度增加时，它会变得更加繁琐。因此，主要问题就变成了讲座视频的高效自动分割和索引，以便更快地检索特定的相关内容。在本文中，我们使用从幻灯片文本和音频文本中提取的主题层次结构提出了讲座视频的自动索引。基于幻灯片文本信息的索引视频更准确，因为字符识别率更高，但文本内容非常抽象和主观。与幻灯片文本相比，音频文本提供了关于主题的全面细节，但是由于更高的WER，检索结果不精确。为了解决这一问题，我们提出了一种利用半监督LDA算法融合幻灯片文本和音频转录信息互补优势的新思路。此外，我们努力通过使用从视频幻灯片中识别的单词作为种子单词来改善模型的学习，并训练模型学习这些种子单词周围的视频转录分布。我们对从Coursera、NPTEL和KLETU (KLE Technological University)下载的500个课堂视频进行了多模式索引方案的性能测试。与使用单模态方法进行索引相比，所提出的基于多模态融合的方案平均提高了44.49%的F-Score。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 National Conference on Communications (NCC)

自引率

0.00%

发文量