Morphology aware data augmentation with neural language models for online hybrid ASR

IF 0.4 3区文学 0 LANGUAGE & LINGUISTICS

Acta Linguistica Academica Pub Date : 2022-11-21 DOI:10.1556/2062.2022.00582

Balázs Tarján, T. Fegyó, P. Mihajlik

{"title":"Morphology aware data augmentation with neural language models for online hybrid ASR","authors":"Balázs Tarján, T. Fegyó, P. Mihajlik","doi":"10.1556/2062.2022.00582","DOIUrl":null,"url":null,"abstract":"Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":" ","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Linguistica Academica","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1556/2062.2022.00582","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.

查看原文本刊更多论文

基于神经语言模型的在线混合ASR形态学感知数据增强

由于匈牙利语的非正式风格和丰富的形态，识别匈牙利语会话电话语音具有挑战性。神经网络语言模型（NNLMs）可以弥补任务的高度困惑；然而，它们的高复杂性使得它们很难应用于在线系统的第一道（单道）。最近的研究表明，通过使用基于神经文本生成的数据扩充，可以将相当一部分NNLMs的知识转移到传统的n-gram中。NNLMs的数据扩充对于隔离语言非常有效；然而，我们发现，在形态丰富的语言中，它会导致词汇爆炸。因此，我们提出了一种新的形态学感知神经文本增强方法，将生成的文本重新命名为统计衍生的子词。我们将基于单词和基于子单词的数据扩充技术与递归和Transformer语言模型的性能进行了比较，结果表明，基于子词的方法可以显著提高单词错误率（WER），同时大大降低词汇大小和内存需求。将基于子词的建模和基于神经语言的数据增强相结合，我们能够实现11%的相对WER降低，并保持会话电话语音识别系统的实时运行。最后，我们还证明了基于子词的神经文本增强不仅在整体WER方面，而且在词汇外（OOV）词的识别方面都优于基于词的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Acta Linguistica Academica Arts and Humanities-Literature and Literary Theory

CiteScore

1.00

自引率

20.00%

发文量

期刊介绍： Acta Linguistica Academica publishes papers on general linguistics. Papers presenting empirical material must have strong theoretical implications. The scope of the journal is not restricted to the core areas of linguistics; it also covers areas such as socio- and psycholinguistics, neurolinguistics, discourse analysis, the philosophy of language, language typology, and formal semantics. The journal also publishes book and dissertation reviews and advertisements.