{"title":"基于图像和标题的LSTM语言模型自适应多媒体自动语音识别","authors":"Yasufumi Moriya, G. Jones","doi":"10.1109/SLT.2018.8639551","DOIUrl":null,"url":null,"abstract":"Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information as a means to improve ASR for this data has recently drawn attention from researchers. Our investigation extends existing ASR methods by using images and video titles to adapt a recurrent neural network (RNN) language model with a long-short term memory (LSTM) network. Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is decreased by around 0.5% on both datasets. By analysing the output word probabilities of the model, it is found that both image adaptation and video title adaptation give the model more confidence in the choice of contextually correct informative words","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition\",\"authors\":\"Yasufumi Moriya, G. Jones\",\"doi\":\"10.1109/SLT.2018.8639551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information as a means to improve ASR for this data has recently drawn attention from researchers. Our investigation extends existing ASR methods by using images and video titles to adapt a recurrent neural network (RNN) language model with a long-short term memory (LSTM) network. Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is decreased by around 0.5% on both datasets. By analysing the output word probabilities of the model, it is found that both image adaptation and video title adaptation give the model more confidence in the choice of contextually correct informative words\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition
Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information as a means to improve ASR for this data has recently drawn attention from researchers. Our investigation extends existing ASR methods by using images and video titles to adapt a recurrent neural network (RNN) language model with a long-short term memory (LSTM) network. Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is decreased by around 0.5% on both datasets. By analysing the output word probabilities of the model, it is found that both image adaptation and video title adaptation give the model more confidence in the choice of contextually correct informative words