Language independent end-to-end architecture for joint language identification and speech recognition

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268945

Shinji Watanabe, Takaaki Hori, J. Hershey

{"title":"Language independent end-to-end architecture for joint language identification and speech recognition","authors":"Shinji Watanabe, Takaaki Hori, J. Hershey","doi":"10.1109/ASRU.2017.8268945","DOIUrl":null,"url":null,"abstract":"End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"127","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 127

Abstract

End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.

查看原文本刊更多论文

用于联合语言识别和语音识别的独立于语言的端到端体系结构

端到端自动语音识别(ASR)可以通过消除对语音字典等语言信息的需求，大大减轻为新语言开发ASR系统的负担。这也创造了一个机会，我们在本文中充分利用，构建一个具有语言独立神经网络架构的单片多语言ASR系统。我们提出了一个可以识别10种不同语言语音的模型，通过直接执行基于字素(字符/分块字符)的语音识别。该模型基于我们的混合注意/连接时间分类(CTC)架构，该架构先前已被证明在几个ASR基准测试中实现了最先进的性能。在这里，我们增加了它的输出符号集，以包含出现在所有目标语言中的字符集的并集。这些字符包括罗马字母和西里尔字母、阿拉伯数字、简体中文和日本汉字/平假名/片假名字符(总共5500个字符)。这允许训练一个单一的多语言模型，其参数在所有语言之间共享。该模型可以联合识别语言和识别语音，自动将识别的文本格式化为合适的字符集。实验使用了由华尔街日报(英语)、日语语料库、科大普通话CTS和Voxforge(德语、西班牙语、法语、意大利语、荷兰语、葡萄牙语、俄语)组成的语音数据库，与依赖语言的端到端自动语音识别系统相比，显示出相当或更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量