Understanding speaking styles of internet speech data with LSTM and low-resource training

2015 International Conference on Affective Computing and Intelligent Interaction (ACII) Pub Date : 2015-09-21 DOI:10.1109/ACII.2015.7344667

Xixin Wu, Zhiyong Wu, Yishuang Ning, Jia Jia, Lianhong Cai, H. Meng

{"title":"Understanding speaking styles of internet speech data with LSTM and low-resource training","authors":"Xixin Wu, Zhiyong Wu, Yishuang Ning, Jia Jia, Lianhong Cai, H. Meng","doi":"10.1109/ACII.2015.7344667","DOIUrl":null,"url":null,"abstract":"Speech are widely used to express one's emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data remains a challenging problem. In previous work, utterance-level statistics of acoustic features are utilized as features in classifying speaking styles, ignoring the local context information. Long short-term memory (LSTM) recurrent neural network (RNN) has achieved exciting success in lots of research areas, such as speech recognition. It is able to retrieve context information for long time duration, which is important in characterizing speaking styles. To train LSTM, huge number of labeled training data is required. While for the scenario of internet speech data classification, it is quite difficult to get such large scale labeled data. On the other hand, we can get some publicly available data for other tasks (such as speech emotion recognition), which offers us a new possibility to exploit LSTM in the low-resource task. We adopt retraining strategy to train LSTM to recognize speaking styles in speech data by training the network on emotion and speaking style datasets sequentially without reset the weights of the network. Experimental results demonstrate that retraining improves the training speed and the accuracy of network in speaking style classification.","PeriodicalId":6863,"journal":{"name":"2015 International Conference on Affective Computing and Intelligent Interaction (ACII)","volume":"24 1","pages":"815-820"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Affective Computing and Intelligent Interaction (ACII)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACII.2015.7344667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Speech are widely used to express one's emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data remains a challenging problem. In previous work, utterance-level statistics of acoustic features are utilized as features in classifying speaking styles, ignoring the local context information. Long short-term memory (LSTM) recurrent neural network (RNN) has achieved exciting success in lots of research areas, such as speech recognition. It is able to retrieve context information for long time duration, which is important in characterizing speaking styles. To train LSTM, huge number of labeled training data is required. While for the scenario of internet speech data classification, it is quite difficult to get such large scale labeled data. On the other hand, we can get some publicly available data for other tasks (such as speech emotion recognition), which offers us a new possibility to exploit LSTM in the low-resource task. We adopt retraining strategy to train LSTM to recognize speaking styles in speech data by training the network on emotion and speaking style datasets sequentially without reset the weights of the network. Experimental results demonstrate that retraining improves the training speed and the accuracy of network in speaking style classification.

查看原文本刊更多论文

利用LSTM和低资源训练来理解网络语音数据的说话风格

在社交网络交际中，言语被广泛用于表达个人的情感、意图、愿望等，衍生出丰富的网络言语数据和不同的说话风格。这些数据为社会多媒体的研究提供了很好的资源。然而，由于网络语音数据中混杂着不同的风格，如何对这些数据进行分类仍然是一个具有挑战性的问题。在以往的研究中，声学特征的话语级统计被用作分类说话风格的特征，忽略了局部语境信息。长短期记忆(LSTM)递归神经网络(RNN)在语音识别等许多研究领域取得了令人兴奋的成功。它能够长时间检索上下文信息，这对于描述说话风格很重要。为了训练LSTM，需要大量的标记训练数据。而对于互联网语音数据分类的场景，要获得如此大规模的标注数据是相当困难的。另一方面，我们可以为其他任务(如语音情感识别)获得一些公开可用的数据，这为我们在低资源任务中利用LSTM提供了新的可能性。我们采用再训练策略，在不重置网络权值的情况下，在情感和说话风格数据集上依次训练网络，训练LSTM识别语音数据中的说话风格。实验结果表明，再训练提高了网络在说话风格分类中的训练速度和准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Affective Computing and Intelligent Interaction (ACII)

自引率

0.00%

发文量