使用视觉词袋方法从语音中提取情感

E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas
{"title":"使用视觉词袋方法从语音中提取情感","authors":"E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas","doi":"10.1109/SMAP.2017.8022672","DOIUrl":null,"url":null,"abstract":"Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.","PeriodicalId":441461,"journal":{"name":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Extracting emotions from speech using a bag-of-visual-words approach\",\"authors\":\"E. Spyrou, Theodoros Giannakopoulos, Dimitris Sgouropoulos, Michalis Papakostas\",\"doi\":\"10.1109/SMAP.2017.8022672\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.\",\"PeriodicalId\":441461,\"journal\":{\"name\":\"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SMAP.2017.8022672\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMAP.2017.8022672","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

人类情绪的识别在某些应用中可能是至关重要的,例如,人机交互,老年人的监测,在课程中理解学习者的情感状态等。为了实现这一目标,根据应用程序和环境,人们可以使用生理参数(例如,心率,大脑活动等),这些参数通常是突出的,或者分析通过简单观察人类可以提取的其他模式,例如视觉(例如,她/他的面部表情,手势,骨骼运动等)或音频(例如,语音)。在许多应用中,唯一可用的情态是后者,即人的声音。在这项工作中,我们的目标是通过只依赖从她/他的声音中提取的副语言信息来分析说话人的情绪,从而抛弃语言的语言方面(即口语)。为了实现这一目标,我们提出了一种受计算机视觉任务启发的新的情感分类方法。我们使用频谱图,它是音频片段频谱的可视化表示。然后,我们提取特征并使用视觉词汇表对其进行编码,并将谱图表示为“视觉词汇袋”。这种表示用于将音频片段分类为情感类。我们在包含不同语言语音的3个数据集上评估了我们的方法,并将其与基线方法进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Extracting emotions from speech using a bag-of-visual-words approach
Recognition of humans' emotions may be crucial in certain applications involving e.g., human-computer interaction, monitoring of elderly, understanding the affective state of learners during a course etc. To this goal and depending on the application and the environment, one may use physiological parameters (e.g., heart rate, brain activity etc.) which are typically obtrusive, or analyze other modalities that may be extracted by simply observing a human, such as visual (e.g., her/his facial expressions, gestures, skeletal motion etc.) or audio (e.g., speech). In many applications the only available modality is the latter one, i.e., the human's voice. In this work we aim to analyze a speaker's emotions by relying only on paralinguistic information, extracted by her/his voice, thus discarding the linguistic aspect of speech (i.e., the spoken words). To this goal, we propose a novel emotion classification approach that has been inspired by computer vision tasks. We use a spectrogram, which is a visual representation of the spectrum of an audio segment. We then extract features and code them using a visual vocabulary and represent a spectrogram as a “bag-of-visual words.” This representation is used for classifying an audio segment to an emotion class. We evaluate our approach on 3 datasets that contain speech from different languages and compare it to baseline methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信