F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer, Björn Schuller
{"title":"野外情绪识别:在多模态决策级融合中纳入声音和嘴唇活动","authors":"F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer, Björn Schuller","doi":"10.1145/2663204.2666271","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms of audiovisual content; e.g., speech and/or face of the subject expressing the emotion can be absent in the data. We therefore propose to tackle this issue by incorporating both voice and lip activity as additional features in a decision-level fusion. Results obtained on the blind test set show that the decision-level fusion can improve the best mono-modal approach, and that the addition of both voice and lip activity in the feature set leads to the best performance (UAR=35.27%), with an absolute improvement of 5.36% over the baseline.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":"{\"title\":\"Emotion Recognition in the Wild: Incorporating Voice and Lip Activity in Multimodal Decision-Level Fusion\",\"authors\":\"F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer, Björn Schuller\",\"doi\":\"10.1145/2663204.2666271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms of audiovisual content; e.g., speech and/or face of the subject expressing the emotion can be absent in the data. We therefore propose to tackle this issue by incorporating both voice and lip activity as additional features in a decision-level fusion. Results obtained on the blind test set show that the decision-level fusion can improve the best mono-modal approach, and that the addition of both voice and lip activity in the feature set leads to the best performance (UAR=35.27%), with an absolute improvement of 5.36% over the baseline.\",\"PeriodicalId\":389037,\"journal\":{\"name\":\"Proceedings of the 16th International Conference on Multimodal Interaction\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"31\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2663204.2666271\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663204.2666271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Emotion Recognition in the Wild: Incorporating Voice and Lip Activity in Multimodal Decision-Level Fusion
In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms of audiovisual content; e.g., speech and/or face of the subject expressing the emotion can be absent in the data. We therefore propose to tackle this issue by incorporating both voice and lip activity as additional features in a decision-level fusion. Results obtained on the blind test set show that the decision-level fusion can improve the best mono-modal approach, and that the addition of both voice and lip activity in the feature set leads to the best performance (UAR=35.27%), with an absolute improvement of 5.36% over the baseline.