无监督声学-发音倒置与可变声道解剖

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-477

Yifan Sun, Qinlong Huang, Xihong Wu

{"title":"无监督声学-发音倒置与可变声道解剖","authors":"Yifan Sun, Qinlong Huang, Xihong Wu","doi":"10.21437/interspeech.2022-477","DOIUrl":null,"url":null,"abstract":"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a ﬁxed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the conﬁguration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory beneﬁts. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4656-4660"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy\",\"authors\":\"Yifan Sun, Qinlong Huang, Xihong Wu\",\"doi\":\"10.21437/interspeech.2022-477\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a ﬁxed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the conﬁguration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory beneﬁts. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"4656-4660\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-477\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

说话者之间的声学和发音可变性在一定程度上限制了声学-发音倒置（AAI）方法的泛化性能。说话人无关AAI（SI-AAI）方法通常侧重于声学特征的转换，但很少考虑发音空间中的直接匹配。无监督的AAI方法具有更好的泛化能力，但通常使用物理发音合成器的固定或光学设置，即使对于不同的说话者也是如此，这可能会导致不合格的发音补偿。在本文中，我们建议在语音倒置过程中联合估计发音运动和声道解剖。采用了无监督的AAI框架，其中估计的声道解剖结构用于设置物理发音合成器的配置，而物理发音合成器又由估计的发音运动驱动，以模仿给定的语音。实验表明，对声道解剖结构的估计可以带来声学和发音方面的好处。在声学上，重建质量更高；在咬合方面，估计的咬合运动轨迹与测量的轨迹更好地匹配。此外，估计的解剖学参数显示出说话者的清晰聚类，表明说话者特征和语言内容的成功解耦。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy

Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a ﬁxed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the conﬁguration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory beneﬁts. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量