基于变压器网络的声-铰接反演流模型

Sathvik Udupa, Aravind Illa, P. Ghosh
{"title":"基于变压器网络的声-铰接反演流模型","authors":"Sathvik Udupa, Aravind Illa, P. Ghosh","doi":"10.21437/interspeech.2022-10159","DOIUrl":null,"url":null,"abstract":"Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"625-629"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Streaming model for Acoustic to Articulatory Inversion with transformer networks\",\"authors\":\"Sathvik Udupa, Aravind Illa, P. Ghosh\",\"doi\":\"10.21437/interspeech.2022-10159\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"625-629\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-10159\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

从语音声学中估计语音发音运动被称为声学到发音反转(AAI)。最近,基于变压器的AAI模型已被证明能够实现最先进的性能。然而,在变压器网络中,注意力集中在整个话语上,因此需要在推理之前获得完整的话语,这导致了高延迟,并且不适合流式AAI。为了在推理过程中实现流,可以在不重叠的卡盘上执行评估,而不是在完整的话语上执行评估。然而,由于在训练和评估过程中注意接受野的不匹配,AAI的表现可能会下降。为了克服这种情况,在这项工作中,我们使用不同的注意力面具进行实验,并在训练期间使用先前预测的上下文。实验结果表明,将随机开始掩码注意与变压器解码器先前预测的上下文结合使用,效果优于基线结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Streaming model for Acoustic to Articulatory Inversion with transformer networks
Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信