{"title":"语音合成中听者发声的注释意义","authors":"Sathish Pammi, M. Schröder","doi":"10.1109/ACII.2009.5349568","DOIUrl":null,"url":null,"abstract":"Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: What kinds of meaning are expressed through listener vocalizations? What form is suitable for a given meaning? And, in what context should which listener vocalizations be produced? In this paper, we address the first of these questions. We present a method to record natural and expressive listener vocalizations for synthesis, and describe our approach to identify a suitable categorical description of the meaning conveyed in the vocalizations. In our data, one actor produces a total of 967 listener vocalizations, in his natural speaking style and three acted emotion-specific personalities. In an open categorization scheme, we find that eleven categories occur on at least 5% of the vocalizations, and that most vocalizations are better described by two or three categories rather than a single one. Furthermore, an annotation of meaning reference, according to Buhler's Organon model, allows us to make interesting observations regarding the listener's own state, his stance towards the interlocutor, and his attitude towards the topic of the conversation.","PeriodicalId":330737,"journal":{"name":"2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Annotating meaning of listener vocalizations for speech synthesis\",\"authors\":\"Sathish Pammi, M. Schröder\",\"doi\":\"10.1109/ACII.2009.5349568\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: What kinds of meaning are expressed through listener vocalizations? What form is suitable for a given meaning? And, in what context should which listener vocalizations be produced? In this paper, we address the first of these questions. We present a method to record natural and expressive listener vocalizations for synthesis, and describe our approach to identify a suitable categorical description of the meaning conveyed in the vocalizations. In our data, one actor produces a total of 967 listener vocalizations, in his natural speaking style and three acted emotion-specific personalities. In an open categorization scheme, we find that eleven categories occur on at least 5% of the vocalizations, and that most vocalizations are better described by two or three categories rather than a single one. Furthermore, an annotation of meaning reference, according to Buhler's Organon model, allows us to make interesting observations regarding the listener's own state, his stance towards the interlocutor, and his attitude towards the topic of the conversation.\",\"PeriodicalId\":330737,\"journal\":{\"name\":\"2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACII.2009.5349568\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACII.2009.5349568","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Annotating meaning of listener vocalizations for speech synthesis
Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: What kinds of meaning are expressed through listener vocalizations? What form is suitable for a given meaning? And, in what context should which listener vocalizations be produced? In this paper, we address the first of these questions. We present a method to record natural and expressive listener vocalizations for synthesis, and describe our approach to identify a suitable categorical description of the meaning conveyed in the vocalizations. In our data, one actor produces a total of 967 listener vocalizations, in his natural speaking style and three acted emotion-specific personalities. In an open categorization scheme, we find that eleven categories occur on at least 5% of the vocalizations, and that most vocalizations are better described by two or three categories rather than a single one. Furthermore, an annotation of meaning reference, according to Buhler's Organon model, allows us to make interesting observations regarding the listener's own state, his stance towards the interlocutor, and his attitude towards the topic of the conversation.