Joakim Gustafson, Éva Székely, Simon Alexandersson, J. Beskow
{"title":"Casual chatter or speaking up? Adjusting articulatory effort in generation of speech and animation for conversational characters","authors":"Joakim Gustafson, Éva Székely, Simon Alexandersson, J. Beskow","doi":"10.1109/FG57933.2023.10042520","DOIUrl":null,"url":null,"abstract":"Embodied conversational agents and social robots need to be able to generate spontaneous behavior in order to be believable in social interactions. We present a system that can generate spontaneous speech with supporting lip movements. The conversational TTS voice is trained on a podcast corpus that has been prosodically tagged (f0, speaking rate and energy) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The speech animation is driven by time-stamped phonemes obtained from the internal alignment attention map of the TTS system, and we use prominence estimates from the synthesised speech waveform to modulate the lip- and jaw movements accordingly.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"40 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FG57933.2023.10042520","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Embodied conversational agents and social robots need to be able to generate spontaneous behavior in order to be believable in social interactions. We present a system that can generate spontaneous speech with supporting lip movements. The conversational TTS voice is trained on a podcast corpus that has been prosodically tagged (f0, speaking rate and energy) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The speech animation is driven by time-stamped phonemes obtained from the internal alignment attention map of the TTS system, and we use prominence estimates from the synthesised speech waveform to modulate the lip- and jaw movements accordingly.