Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura
{"title":"多模态多任务深度学习与电视连续剧数据的情感识别","authors":"Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura","doi":"10.1109/ICSDA.2018.8693020","DOIUrl":null,"url":null,"abstract":"Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features to simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, and facial. To fuse the features from multiple modalities, we experimented on three methods: majority voting, concatenation, and hierarchical fusion. The recognition was done from TV-series dataset that simulate actual conversations.","PeriodicalId":303819,"journal":{"name":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data\",\"authors\":\"Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura\",\"doi\":\"10.1109/ICSDA.2018.8693020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features to simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, and facial. To fuse the features from multiple modalities, we experimented on three methods: majority voting, concatenation, and hierarchical fusion. The recognition was done from TV-series dataset that simulate actual conversations.\",\"PeriodicalId\":303819,\"journal\":{\"name\":\"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSDA.2018.8693020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Oriental COCOSDA - International Conference on Speech Database and Assessments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2018.8693020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data
Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features to simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, and facial. To fuse the features from multiple modalities, we experimented on three methods: majority voting, concatenation, and hierarchical fusion. The recognition was done from TV-series dataset that simulate actual conversations.