C. Smailis, N. Sarafianos, Theodoros Giannakopoulos, S. Perantonis
{"title":"Fusing active orientation models and mid-term audio features for automatic depression estimation","authors":"C. Smailis, N. Sarafianos, Theodoros Giannakopoulos, S. Perantonis","doi":"10.1145/2910674.2935856","DOIUrl":null,"url":null,"abstract":"In this paper, we predict a human's depression level in the BDI-II scale, using facial and voice features. Active orientation models (AOM) and several voice features were extracted from the video and audio modalities. Long-term and mid-term features were computed and a fusion is performed in the feature space. Videos from the Depression Recognition Sub-Challenge of the 2014 Audio-Visual Emotion Challenge and Workshop (AVEC 2014) were used and support vector regression models were trained to predict the depression level. We demonstrated that the fusion of AOMs with audio features leads to better performance compared to individual modalities. The obtained regression results indicate the robustness of the proposed technique, under different settings, as well as an RMSE improvement compared to the AVEC 2014 video baseline.","PeriodicalId":359504,"journal":{"name":"Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2910674.2935856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we predict a human's depression level in the BDI-II scale, using facial and voice features. Active orientation models (AOM) and several voice features were extracted from the video and audio modalities. Long-term and mid-term features were computed and a fusion is performed in the feature space. Videos from the Depression Recognition Sub-Challenge of the 2014 Audio-Visual Emotion Challenge and Workshop (AVEC 2014) were used and support vector regression models were trained to predict the depression level. We demonstrated that the fusion of AOMs with audio features leads to better performance compared to individual modalities. The obtained regression results indicate the robustness of the proposed technique, under different settings, as well as an RMSE improvement compared to the AVEC 2014 video baseline.