Multi-Modal Multi-Task Deep Learning For Speaker And Emotion Recognition Of TV-Series Data

2018 Oriental COCOSDA - International Conference on Speech Database and Assessments Pub Date : 2018-05-07 DOI:10.1109/ICSDA.2018.8693020

Sashi Novitasari, Quoc Truong Do, S. Sakti, D. Lestari, Satoshi Nakamura

引用次数: 3

Abstract

Since paralinguistic aspects must be considered to understand speech, we construct a deep learning framework that utilizes multi-modal features to simultaneously recognize both speakers and emotions. There are three kinds of feature modalities: acoustic, lexical, and facial. To fuse the features from multiple modalities, we experimented on three methods: majority voting, concatenation, and hierarchical fusion. The recognition was done from TV-series dataset that simulate actual conversations.

查看原文本刊更多论文

多模态多任务深度学习与电视连续剧数据的情感识别

由于必须考虑副语言方面来理解语音，因此我们构建了一个利用多模态特征同时识别说话者和情绪的深度学习框架。有三种特征形态:声学、词汇和面部。为了融合来自多种模式的特征，我们实验了三种方法:多数投票、串联和分层融合。这种识别是通过模拟真实对话的电视剧数据集完成的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Oriental COCOSDA - International Conference on Speech Database and Assessments

自引率

0.00%

发文量