一个基于深度学习的框架，用于将手语转换为情感语言

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2018-11-01 DOI:10.23919/APSIPA.2018.8659571

Nan Song, Hongwu Yang, Pengpeng Zhi

{"title":"一个基于深度学习的框架，用于将手语转换为情感语言","authors":"Nan Song, Hongwu Yang, Pengpeng Zhi","doi":"10.23919/APSIPA.2018.8659571","DOIUrl":null,"url":null,"abstract":"This paper proposes a framework for converting sign language to emotional speech by deep learning. We firstly adopt a deep neural network (DNN) model to extract the features of sign language and facial expression. Then we train two support vector machines (SVM) to classify the sign language and facial expression for recognizing the text of sign language and emotional tags of facial expression. We also train a set of DNN-based emotional speech acoustic models by speaker adaptive training with an multi-speaker emotional speech corpus. Finally, we select the DNN-based emotional speech acoustic models with emotion tags to synthesize emotional speech from the text recognized from the sign language. Objective tests show that the recognition rate for static sign language is 90.7%. The recognition rate of facial expression achieves 94.6% on the extended Cohn-Kanade database (CK+) and 80.3% on the Japanese Female Facial Expression (JAFFE) database respectively. Subjective evaluation demonstrates that synthesized emotional speech can get 4.2 of the emotional mean opinion score. The pleasure-arousal-dominance (PAD) tree dimensional emotion model is employed to evaluate the PAD values for both facial expression and synthesized emotional speech. Results show that the PAD values of facial expression are close to the PAD values of synthesized emotional speech. This means that the synthesized emotional speech can express the emotions of facial expression.","PeriodicalId":287799,"journal":{"name":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A deep learning based framework for converting sign language to emotional speech\",\"authors\":\"Nan Song, Hongwu Yang, Pengpeng Zhi\",\"doi\":\"10.23919/APSIPA.2018.8659571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a framework for converting sign language to emotional speech by deep learning. We firstly adopt a deep neural network (DNN) model to extract the features of sign language and facial expression. Then we train two support vector machines (SVM) to classify the sign language and facial expression for recognizing the text of sign language and emotional tags of facial expression. We also train a set of DNN-based emotional speech acoustic models by speaker adaptive training with an multi-speaker emotional speech corpus. Finally, we select the DNN-based emotional speech acoustic models with emotion tags to synthesize emotional speech from the text recognized from the sign language. Objective tests show that the recognition rate for static sign language is 90.7%. The recognition rate of facial expression achieves 94.6% on the extended Cohn-Kanade database (CK+) and 80.3% on the Japanese Female Facial Expression (JAFFE) database respectively. Subjective evaluation demonstrates that synthesized emotional speech can get 4.2 of the emotional mean opinion score. The pleasure-arousal-dominance (PAD) tree dimensional emotion model is employed to evaluate the PAD values for both facial expression and synthesized emotional speech. Results show that the PAD values of facial expression are close to the PAD values of synthesized emotional speech. This means that the synthesized emotional speech can express the emotions of facial expression.\",\"PeriodicalId\":287799,\"journal\":{\"name\":\"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPA.2018.8659571\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPA.2018.8659571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一个通过深度学习将手语转化为情感语言的框架。我们首先采用深度神经网络(DNN)模型提取手语和面部表情特征。然后训练支持向量机对手语和面部表情进行分类，实现对手语文本和面部表情情感标签的识别。我们还利用多说话人的情绪语音语料库，通过说话人自适应训练，训练了一套基于dnn的情绪语音声学模型。最后，选择带有情感标签的基于dnn的情感语音声学模型，从手语识别的文本中合成情感语音。客观测试表明，静态手语的识别率为90.7%。面部表情识别率在扩展Cohn-Kanade数据库(CK+)上达到94.6%，在日本女性面部表情数据库(JAFFE)上达到80.3%。主观评价表明，综合情感言语可以得到情感平均意见得分的4.2分。采用快乐-觉醒-优势(PAD)三维情绪模型评价面部表情和综合情绪言语的PAD值。结果表明，面部表情的PAD值与合成情感语音的PAD值接近。这意味着合成的情绪语音可以表达面部表情的情绪。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A deep learning based framework for converting sign language to emotional speech

This paper proposes a framework for converting sign language to emotional speech by deep learning. We firstly adopt a deep neural network (DNN) model to extract the features of sign language and facial expression. Then we train two support vector machines (SVM) to classify the sign language and facial expression for recognizing the text of sign language and emotional tags of facial expression. We also train a set of DNN-based emotional speech acoustic models by speaker adaptive training with an multi-speaker emotional speech corpus. Finally, we select the DNN-based emotional speech acoustic models with emotion tags to synthesize emotional speech from the text recognized from the sign language. Objective tests show that the recognition rate for static sign language is 90.7%. The recognition rate of facial expression achieves 94.6% on the extended Cohn-Kanade database (CK+) and 80.3% on the Japanese Female Facial Expression (JAFFE) database respectively. Subjective evaluation demonstrates that synthesized emotional speech can get 4.2 of the emotional mean opinion score. The pleasure-arousal-dominance (PAD) tree dimensional emotion model is employed to evaluate the PAD values for both facial expression and synthesized emotional speech. Results show that the PAD values of facial expression are close to the PAD values of synthesized emotional speech. This means that the synthesized emotional speech can express the emotions of facial expression.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量