End-to-end recognition of streaming Japanese speech using CTC and local attention

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2020-11-23 DOI:10.1017/ATSIP.2020.23

Jiahao Chen, Ryota Nishimura, N. Kitaoka

{"title":"End-to-end recognition of streaming Japanese speech using CTC and local attention","authors":"Jiahao Chen, Ryota Nishimura, N. Kitaoka","doi":"10.1017/ATSIP.2020.23","DOIUrl":null,"url":null,"abstract":"Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2020-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2020.23","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2020.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 2

Abstract

Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.

查看原文本刊更多论文

基于CTC和局部关注的日语流媒体语音端到端识别

许多端到端、大词汇量、连续的语音识别系统现在能够实现比传统系统更好的语音识别性能。然而，这些方法大多基于双向网络和序列到序列建模，因此使用此类技术的自动语音识别(ASR)系统在开始处理数据之前需要等待整个语音输入段的输入，从而导致长时间滞后，这在某些应用中可能是一个严重的缺点。解决这个问题的一个显而易见的方法是开发一种能够处理流数据的语音识别算法。因此，在本文中，我们利用基于连接时间分类(CTC)标准训练的单向lstm模型，探索了一个具有局部关注的日语流媒体在线ASR系统的可能性。由于大多数日语ASR系统采用双向网络，这种方法尚未被很好地研究用于日语。在实验评估中，我们提出的系统的最佳结果是字符错误率为9.87%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊