{"title":"Multi-resolution spectral input for convolutional neural network-based speech recognition","authors":"L. Tóth","doi":"10.1109/SPED.2017.7990430","DOIUrl":null,"url":null,"abstract":"The convolutional deep neural network component applied frequently in current speech recognizers is trained on a context of consecutive spectral feature vectors. Here, we investigate whether we can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input. In the proposed multi-resolution scheme, the network processes the nearby neighbors of the actual frame using the standard resolution, while it applies a gradually coarser resolution for more distant frames. Using this solution, we managed to extend the input of our network to a time context of 45 frames without increasing the number of input features, and we also achieved a relative error rate reduction of 3–4% compared to the conventional high-resolution representation. We report a phone error rate of 17.0% on the TIMIT core test set, which is competitive with the best scores published on this data set.","PeriodicalId":345314,"journal":{"name":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPED.2017.7990430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The convolutional deep neural network component applied frequently in current speech recognizers is trained on a context of consecutive spectral feature vectors. Here, we investigate whether we can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input. In the proposed multi-resolution scheme, the network processes the nearby neighbors of the actual frame using the standard resolution, while it applies a gradually coarser resolution for more distant frames. Using this solution, we managed to extend the input of our network to a time context of 45 frames without increasing the number of input features, and we also achieved a relative error rate reduction of 3–4% compared to the conventional high-resolution representation. We report a phone error rate of 17.0% on the TIMIT core test set, which is competitive with the best scores published on this data set.