End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11357

Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

{"title":"End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training","authors":"Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando","doi":"10.21437/interspeech.2022-11357","DOIUrl":null,"url":null,"abstract":"This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3218-3222"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.

查看原文本刊更多论文

会话历史依赖和独立ASR系统的多历史训练端到端联合建模

本文提出了会话历史依赖和独立自动语音识别系统的端到端联合建模方法。会话历史记录在ASR系统中可用，例如会议转录应用程序，但在语音搜索应用程序中不可用。到目前为止，这两个ASR系统都是使用不同的模型单独构建的，但是这对于每个应用程序来说都是低效的。事实上，传统的会话历史相关ASR系统既可以执行历史相关处理，也可以执行独立处理。然而，它们的性能不如历史无关的ASR系统。这是因为传统会话历史相关的ASR系统中的模型体系结构及其训练标准专门用于会话历史可用的情况。为了解决这个问题，我们提出的端到端联合建模方法使用了一种基于跨模式转换器的体系结构，该体系结构可以灵活地切换使用或不使用会话历史。此外，我们提出了多历史训练，同时利用无历史数据集和具有不同历史数据集，通过引入统一的体系结构有效地改进两种类型的ASR处理。在日语ASR任务上的实验验证了该方法的有效性。多历史训练，可以生成针对各种会话上下文和无会话上下文的鲁棒ASR模型。实验结果表明，与传统的E2E-ASR系统相比，所提出的E2E联合模型在历史依赖和独立ASR处理方面都具有更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量