{"title":"Semantically similar document retrieval framework for language model speaker adaptation","authors":"J. Staš, D. Zlacký, D. Hládek","doi":"10.1109/RADIOELEK.2016.7477408","DOIUrl":null,"url":null,"abstract":"The paper deals with semantically similar document retrieval framework for language model adaptation in Slovak to a specific speaker speaking style. This research extends our previous study oriented on language model speaker adaptation for transcription of Slovak parliament proceedings with available speaker-specific text data. We used a large corpora for retrieving semantically similar subset of text documents for each speaker to adjust parameters of an existing well-trained language model to a specific speaker speaking style. The same large corpora was used to build original topic-specific model of the Slovak language deployed in our automatic subtitling system. In the proposed framework, the latent semantic indexing was implemented to retrieve the subset of semantically similar documents. The output hypotheses from the first step of speech recognition were used to identify patterns between terms and concepts contained in an unstructured collection of text documents. Preliminary results show a slight improvement in speech recognition accuracy for individual speaker in fully automatic subtitling of parliament speech, broadcast news TV shows and TEDx talks.","PeriodicalId":159747,"journal":{"name":"2016 26th International Conference Radioelektronika (RADIOELEKTRONIKA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 26th International Conference Radioelektronika (RADIOELEKTRONIKA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RADIOELEK.2016.7477408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The paper deals with semantically similar document retrieval framework for language model adaptation in Slovak to a specific speaker speaking style. This research extends our previous study oriented on language model speaker adaptation for transcription of Slovak parliament proceedings with available speaker-specific text data. We used a large corpora for retrieving semantically similar subset of text documents for each speaker to adjust parameters of an existing well-trained language model to a specific speaker speaking style. The same large corpora was used to build original topic-specific model of the Slovak language deployed in our automatic subtitling system. In the proposed framework, the latent semantic indexing was implemented to retrieve the subset of semantically similar documents. The output hypotheses from the first step of speech recognition were used to identify patterns between terms and concepts contained in an unstructured collection of text documents. Preliminary results show a slight improvement in speech recognition accuracy for individual speaker in fully automatic subtitling of parliament speech, broadcast news TV shows and TEDx talks.