{"title":"自动配音中使用语音内外韵律对齐的字幕合成","authors":"Giridhar Pamisetty, S. Kodukula","doi":"10.1109/NCC55593.2022.9806799","DOIUrl":null,"url":null,"abstract":"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing\",\"authors\":\"Giridhar Pamisetty, S. Kodukula\",\"doi\":\"10.1109/NCC55593.2022.9806799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.\",\"PeriodicalId\":403870,\"journal\":{\"name\":\"2022 National Conference on Communications (NCC)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 National Conference on Communications (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC55593.2022.9806799\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806799","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing
Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.