基于卷积神经网络语音学习表征的语音速率估计

2020 International Conference on Signal Processing and Communications (SPCOM) Pub Date : 2020-07-01 DOI:10.1109/SPCOM50965.2020.9179502

Renuka Mannem, H. Jyothi, Aravind Illa, P. Ghosh

{"title":"基于卷积神经网络语音学习表征的语音速率估计","authors":"Renuka Mannem, H. Jyothi, Aravind Illa, P. Ghosh","doi":"10.1109/SPCOM50965.2020.9179502","DOIUrl":null,"url":null,"abstract":"With advancement in machine learning techniques, several speech related applications deploy end-to-end models to learn relevant features from the raw speech signal. In this work, we focus on the speech rate estimation task using an end-to-end model to learn representation from raw speech in a data driven manner. We propose an end-to-end model that comprises of 1-d convolutional layer to extract representations from raw speech and a convolutional dense neural network (CDNN) to predict speech rate from these representations. The primary aim of the work is to understand the nature of representations learned by end-to-end model for the speech rate estimation task. Experiments are performed using TIMIT corpus, in seen and unseen subject conditions. Experimental results reveal that, the frequency response of the learned 1-d CNN filters are low-pass in nature, and center frequencies of majority of the filters lie below 1000Hz. While comparing the performance of the proposed end-to-end system with the baseline MFCC based approach, we find that the performance of the learned features with CNN are on par with MFCC.","PeriodicalId":208527,"journal":{"name":"2020 International Conference on Signal Processing and Communications (SPCOM)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Speech rate estimation using representations learned from speech with convolutional neural network\",\"authors\":\"Renuka Mannem, H. Jyothi, Aravind Illa, P. Ghosh\",\"doi\":\"10.1109/SPCOM50965.2020.9179502\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With advancement in machine learning techniques, several speech related applications deploy end-to-end models to learn relevant features from the raw speech signal. In this work, we focus on the speech rate estimation task using an end-to-end model to learn representation from raw speech in a data driven manner. We propose an end-to-end model that comprises of 1-d convolutional layer to extract representations from raw speech and a convolutional dense neural network (CDNN) to predict speech rate from these representations. The primary aim of the work is to understand the nature of representations learned by end-to-end model for the speech rate estimation task. Experiments are performed using TIMIT corpus, in seen and unseen subject conditions. Experimental results reveal that, the frequency response of the learned 1-d CNN filters are low-pass in nature, and center frequencies of majority of the filters lie below 1000Hz. While comparing the performance of the proposed end-to-end system with the baseline MFCC based approach, we find that the performance of the learned features with CNN are on par with MFCC.\",\"PeriodicalId\":208527,\"journal\":{\"name\":\"2020 International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"84 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM50965.2020.9179502\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM50965.2020.9179502","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着机器学习技术的进步，一些语音相关应用部署端到端模型来从原始语音信号中学习相关特征。在这项工作中，我们专注于语音速率估计任务，使用端到端模型以数据驱动的方式从原始语音中学习表示。我们提出了一个端到端模型，该模型由一维卷积层和卷积密集神经网络(CDNN)组成，前者用于从原始语音中提取表征，后者用于从这些表征中预测语音速率。这项工作的主要目的是了解端到端模型在语音速率估计任务中学习的表征的本质。实验使用TIMIT语料库，在可见和未见的受试者条件下进行。实验结果表明，学习到的一维CNN滤波器的频率响应本质上是低通的，大多数滤波器的中心频率在1000Hz以下。在将提出的端到端系统的性能与基于基线MFCC的方法进行比较时，我们发现使用CNN学习到的特征的性能与MFCC相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speech rate estimation using representations learned from speech with convolutional neural network

With advancement in machine learning techniques, several speech related applications deploy end-to-end models to learn relevant features from the raw speech signal. In this work, we focus on the speech rate estimation task using an end-to-end model to learn representation from raw speech in a data driven manner. We propose an end-to-end model that comprises of 1-d convolutional layer to extract representations from raw speech and a convolutional dense neural network (CDNN) to predict speech rate from these representations. The primary aim of the work is to understand the nature of representations learned by end-to-end model for the speech rate estimation task. Experiments are performed using TIMIT corpus, in seen and unseen subject conditions. Experimental results reveal that, the frequency response of the learned 1-d CNN filters are low-pass in nature, and center frequencies of majority of the filters lie below 1000Hz. While comparing the performance of the proposed end-to-end system with the baseline MFCC based approach, we find that the performance of the learned features with CNN are on par with MFCC.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 International Conference on Signal Processing and Communications (SPCOM)

自引率

0.00%

发文量