{"title":"Multimodal Fusion of Speech and Text using Semi-supervised LDA for Indexing Lecture Videos","authors":"M. Husain, S. Meena","doi":"10.1109/NCC.2019.8732253","DOIUrl":null,"url":null,"abstract":"Lecture videos are the most popular learning materials due to their pedagogical benefits. However, accessing a topic or subtopic of interest requires manual examination of each frame of the video and it is more tedious when the volume and length of videos increases. The main problem thus becomes the efficient automatic segmentation and indexing of lecture videos that enables faster retrieval of specific and relevant content. In this paper, we present automatic indexing of lecture videos using topic hierarchies extracted from slide text and audio transcripts. Indexing videos based on slide text information is more accurate due to higher character recognition rates but, text content is very abstract and subjective. In contrast to slide text, audio transcripts provide comprehensive details about the topics, however retrieval results are imprecise due to higher WER. In order to address this problem, we propose a novel idea of fusing complementary strengths of slide text and audio transcript information using semi-supervised LDA algorithm. Further, we strive to improve learning of the model by utilizing words recognized from video slides as seed words and train the model to learn the distribution of video transcriptions around these seed words. We test the performance of proposed multimodal indexing scheme on 500 number of class room videos downloaded from Coursera, NPTEL and KLETU (KLE Technological University) classroom videos. The proposed multimodal fusion based scheme achieves an average percentage improvement of 44.49% F-Score compared with indexing using unimodal approaches.","PeriodicalId":6870,"journal":{"name":"2019 National Conference on Communications (NCC)","volume":"14 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2019.8732253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Lecture videos are the most popular learning materials due to their pedagogical benefits. However, accessing a topic or subtopic of interest requires manual examination of each frame of the video and it is more tedious when the volume and length of videos increases. The main problem thus becomes the efficient automatic segmentation and indexing of lecture videos that enables faster retrieval of specific and relevant content. In this paper, we present automatic indexing of lecture videos using topic hierarchies extracted from slide text and audio transcripts. Indexing videos based on slide text information is more accurate due to higher character recognition rates but, text content is very abstract and subjective. In contrast to slide text, audio transcripts provide comprehensive details about the topics, however retrieval results are imprecise due to higher WER. In order to address this problem, we propose a novel idea of fusing complementary strengths of slide text and audio transcript information using semi-supervised LDA algorithm. Further, we strive to improve learning of the model by utilizing words recognized from video slides as seed words and train the model to learn the distribution of video transcriptions around these seed words. We test the performance of proposed multimodal indexing scheme on 500 number of class room videos downloaded from Coursera, NPTEL and KLETU (KLE Technological University) classroom videos. The proposed multimodal fusion based scheme achieves an average percentage improvement of 44.49% F-Score compared with indexing using unimodal approaches.