{"title":"参数化情感歌声合成","authors":"Younsung Park, Sungrack Yun, C. Yoo","doi":"10.1109/ICASSP.2010.5495137","DOIUrl":null,"url":null,"abstract":"This paper describes an algorithm to control the expressed emotion of a synthesized song. Based on the database of various melodies sung neutrally with restricted set of words, hidden semi-Markov models (HSMMs) of notes ranging from E3 to G5 are constructed for synthesizing singing voice. Three steps are taken in the synthesis: (1) Pitch and duration are determined according to the notes indicated by the musical score; (2) Features are sampled from appropriate HSMMs with the duration set to the maximum probability; (3) Singing voice is synthesized by the mel-log spectrum approximation (MLSA) filter using the sampled features as parameters of the filter. Emotion of a synthesized song is controlled by varying the duration and the vibrato parameters according to the Thayer's mood model. Perception test is performed to evaluate the synthesized song. The results show that the algorithm can control the expressed emotion of a singing voice given a neutral singing voice database.","PeriodicalId":293333,"journal":{"name":"2010 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Parametric emotional singing voice synthesis\",\"authors\":\"Younsung Park, Sungrack Yun, C. Yoo\",\"doi\":\"10.1109/ICASSP.2010.5495137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an algorithm to control the expressed emotion of a synthesized song. Based on the database of various melodies sung neutrally with restricted set of words, hidden semi-Markov models (HSMMs) of notes ranging from E3 to G5 are constructed for synthesizing singing voice. Three steps are taken in the synthesis: (1) Pitch and duration are determined according to the notes indicated by the musical score; (2) Features are sampled from appropriate HSMMs with the duration set to the maximum probability; (3) Singing voice is synthesized by the mel-log spectrum approximation (MLSA) filter using the sampled features as parameters of the filter. Emotion of a synthesized song is controlled by varying the duration and the vibrato parameters according to the Thayer's mood model. Perception test is performed to evaluate the synthesized song. The results show that the algorithm can control the expressed emotion of a singing voice given a neutral singing voice database.\",\"PeriodicalId\":293333,\"journal\":{\"name\":\"2010 IEEE International Conference on Acoustics, Speech and Signal Processing\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Acoustics, Speech and Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2010.5495137\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Acoustics, Speech and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2010.5495137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
This paper describes an algorithm to control the expressed emotion of a synthesized song. Based on the database of various melodies sung neutrally with restricted set of words, hidden semi-Markov models (HSMMs) of notes ranging from E3 to G5 are constructed for synthesizing singing voice. Three steps are taken in the synthesis: (1) Pitch and duration are determined according to the notes indicated by the musical score; (2) Features are sampled from appropriate HSMMs with the duration set to the maximum probability; (3) Singing voice is synthesized by the mel-log spectrum approximation (MLSA) filter using the sampled features as parameters of the filter. Emotion of a synthesized song is controlled by varying the duration and the vibrato parameters according to the Thayer's mood model. Perception test is performed to evaluate the synthesized song. The results show that the algorithm can control the expressed emotion of a singing voice given a neutral singing voice database.