E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas
{"title":"使用视觉词袋方法从语音中提取情感","authors":"E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas","doi":"10.1109/SMAP.2017.8022672","DOIUrl":null,"url":null,"abstract":"Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.","PeriodicalId":441461,"journal":{"name":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Extracting emotions from speech using a bag-of-visual-words approach\",\"authors\":\"E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas\",\"doi\":\"10.1109/SMAP.2017.8022672\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.\",\"PeriodicalId\":441461,\"journal\":{\"name\":\"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SMAP.2017.8022672\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMAP.2017.8022672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Extracting emotions from speech using a bag-of-visual-words approach
Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.