{"title":"Recognition of target domain Japanese speech using language model replacement","authors":"Daiki Mori, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka","doi":"10.1186/s13636-024-00360-8","DOIUrl":null,"url":null,"abstract":"End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"27 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00360-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
End-to-end (E2E) automatic speech recognition (ASR) models, which consist of deep learning models, are able to perform ASR tasks using a single neural network. These models should be trained using a large amount of data; however, collecting speech data which matches the targeted speech domain can be difficult, so speech data is often used that is not an exact match to the target domain, resulting in lower performance. In comparison to speech data, in-domain text data is much easier to obtain. Thus, traditional ASR systems use separately trained language models and HMM-based acoustic models. However, it is difficult to separate language information from an E2E ASR model because the model learns both acoustic and language information in an integrated manner, making it very difficult to create E2E ASR models for specialized target domain which are able to achieve sufficient recognition performance at a reasonable cost. In this paper, we propose a method of replacing the language information within pre-trained E2E ASR models in order to achieve adaptation to a target domain. This is achieved by deleting the “implicit” language information contained within the ASR model by subtracting the source-domain language model trained with a transcription of the ASR’s training data in a logarithmic domain. We then integrate a target domain language model through addition in the logarithmic domain. This subtraction and addition to replace of the language model is based on Bayes’ theorem. In our experiment, we first used two datasets of the Corpus of Spontaneous Japanese (CSJ) to evaluate the effectiveness of our method. We then we evaluated our method using the Japanese Newspaper Article Speech (JNAS) and CSJ corpora, which contain audio data from the read speech and spontaneous speech domain, respectively, to test the effectiveness of our proposed method at bridging the gap between these two language domains. Our results show that our proposed language model replacement method achieved better ASR performance than both non-adapted (baseline) ASR models and ASR models adapted using the conventional Shallow Fusion method.
端到端 (E2E) 自动语音识别(ASR)模型由深度学习模型组成,能够使用单个神经网络执行 ASR 任务。这些模型应使用大量数据进行训练;然而,收集与目标语音域相匹配的语音数据可能很困难,因此经常会使用与目标域不完全匹配的语音数据,从而导致性能降低。与语音数据相比,域内文本数据更容易获得。因此,传统的 ASR 系统使用单独训练的语言模型和基于 HMM 的声学模型。然而,E2E ASR 模型很难将语言信息分离出来,因为该模型是以综合方式学习声学和语言信息的,这使得为专门目标域创建 E2E ASR 模型非常困难,而这些模型又能以合理的成本达到足够的识别性能。在本文中,我们提出了一种在预训练的 E2E ASR 模型中替换语言信息的方法,以实现对目标领域的适应。具体做法是删除 ASR 模型中包含的 "隐含 "语言信息,方法是减去用对数域 ASR 训练数据转录训练的源域语言模型。然后,我们在对数域中通过加法整合目标域语言模型。这种语言模型的减法和加法替换是基于贝叶斯定理的。在实验中,我们首先使用了自发日语语料库(CSJ)的两个数据集来评估我们方法的有效性。然后,我们使用日语报纸文章语音(JNAS)和 CSJ 语料库对我们的方法进行了评估,这两个语料库分别包含朗读语音和自发语音领域的音频数据,以测试我们提出的方法在缩小这两个语言领域之间的差距方面的有效性。结果表明,我们提出的语言模型替换方法比非适配(基线)ASR 模型和使用传统浅层融合方法适配的 ASR 模型都取得了更好的 ASR 性能。
期刊介绍:
The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.