Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003816

Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka

{"title":"Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data","authors":"Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka","doi":"10.1109/ASRU46091.2019.9003816","DOIUrl":null,"url":null,"abstract":"This paper presents a novel training method for speech-based end-of-turn detection for which not only manually annotated speech data sets but also punctuated text data sets are utilized. The speech-based end-of-turn detection estimates whether a target speaker's utterance is ended or not using speech information. In previous studies, the speech-based end-of-turn detection models were trained using only speech data sets that contained manually annotated end-of-turn labels. However, since the amounts of annotated speech data sets are often limited, the end-of-turn detection models were unable to correctly handle a wide variety of speech patterns. In order to mitigate the data scarcity problem, our key idea is to leverage punctuated text data sets for building more effective speech-based end-of-turn detection. Therefore, the proposed method introduces cross-modal representation learning to construct a speech encoder and a text encoder that can map speech and text with the same lexical information into similar vector representations. This enables us to train speech-based end-of-turn detection models from the punctuated text data sets by tackling text-based sentence boundary detection. In experiments on contact center calls, we show that speech-based end-of-turn detection models using hierarchical recurrent neural networks can be improved through the use of punctuated text data sets.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

This paper presents a novel training method for speech-based end-of-turn detection for which not only manually annotated speech data sets but also punctuated text data sets are utilized. The speech-based end-of-turn detection estimates whether a target speaker's utterance is ended or not using speech information. In previous studies, the speech-based end-of-turn detection models were trained using only speech data sets that contained manually annotated end-of-turn labels. However, since the amounts of annotated speech data sets are often limited, the end-of-turn detection models were unable to correctly handle a wide variety of speech patterns. In order to mitigate the data scarcity problem, our key idea is to leverage punctuated text data sets for building more effective speech-based end-of-turn detection. Therefore, the proposed method introduces cross-modal representation learning to construct a speech encoder and a text encoder that can map speech and text with the same lexical information into similar vector representations. This enables us to train speech-based end-of-turn detection models from the punctuated text data sets by tackling text-based sentence boundary detection. In experiments on contact center calls, we show that speech-based end-of-turn detection models using hierarchical recurrent neural networks can be improved through the use of punctuated text data sets.

查看原文本刊更多论文

基于标点文本数据的跨模态表示学习改进基于语音的回合结束检测

本文提出了一种新的基于语音结束检测的训练方法，该方法不仅利用人工标注的语音数据集，而且利用标点符号的文本数据集。基于语音的回合结束检测利用语音信息估计目标说话人的话语是否结束。在之前的研究中，基于语音的回合结束检测模型仅使用包含手动注释的回合结束标签的语音数据集进行训练。然而，由于注释语音数据集的数量通常是有限的，因此回合结束检测模型无法正确处理各种语音模式。为了缓解数据稀缺问题，我们的关键思想是利用标点文本数据集来构建更有效的基于语音的回合结束检测。因此，该方法引入了跨模态表示学习，构建了一个语音编码器和一个文本编码器，可以将具有相同词汇信息的语音和文本映射为相似的向量表示。这使我们能够通过处理基于文本的句子边界检测，从加标点的文本数据集中训练基于语音的回合结束检测模型。在呼叫中心的实验中，我们表明使用分层递归神经网络的基于语音的回合结束检测模型可以通过使用标点符号文本数据集得到改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量