A discriminatively trained Hough Transform for frame-level phoneme recognition

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2014-07-14 DOI:10.1109/ICASSP.2014.6854053

J. Dennis, T. H. Dat, Haizhou Li, Chng Eng Siong

{"title":"A discriminatively trained Hough Transform for frame-level phoneme recognition","authors":"J. Dennis, T. H. Dat, Haizhou Li, Chng Eng Siong","doi":"10.1109/ICASSP.2014.6854053","DOIUrl":null,"url":null,"abstract":"Despite recent advances in the use of Artificial Neural Network (ANN) architectures for automatic speech recognition (ASR), relatively little attention has been given to using feature inputs beyond MFCCs in such systems. In this paper, we propose an alternative to conventional MFCC or filterbank features, using an approach based on the Generalised Hough Transform (GHT). The GHT is a common approach used in the field of image processing for the task of object detection, where the idea is to learn the spatial distribution of a codebook of feature information relative to the location of the target class. During recognition, a simple weighted summation of the codebook activations is commonly used to detect the presence of the target classes. Here we propose to learn the weighting discriminatively in an ANN, where the aim is to optimise the static phone classification error at the output of the network. As such an ANN is common to hybrid ASR architectures, the output activations from the GHT can be considered as a novel feature for ASR. Experimental results on the TIMIT phoneme recognition task demonstrate the state-of-the-art performance of the approach.","PeriodicalId":6545,"journal":{"name":"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"6 12 1","pages":"2514-2518"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2014.6854053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Despite recent advances in the use of Artificial Neural Network (ANN) architectures for automatic speech recognition (ASR), relatively little attention has been given to using feature inputs beyond MFCCs in such systems. In this paper, we propose an alternative to conventional MFCC or filterbank features, using an approach based on the Generalised Hough Transform (GHT). The GHT is a common approach used in the field of image processing for the task of object detection, where the idea is to learn the spatial distribution of a codebook of feature information relative to the location of the target class. During recognition, a simple weighted summation of the codebook activations is commonly used to detect the presence of the target classes. Here we propose to learn the weighting discriminatively in an ANN, where the aim is to optimise the static phone classification error at the output of the network. As such an ANN is common to hybrid ASR architectures, the output activations from the GHT can be considered as a novel feature for ASR. Experimental results on the TIMIT phoneme recognition task demonstrate the state-of-the-art performance of the approach.

查看原文本刊更多论文

基于判别训练的Hough变换的帧级音素识别

尽管最近在自动语音识别(ASR)中使用人工神经网络(ANN)架构取得了进展，但在此类系统中使用mfc以外的特征输入的关注相对较少。在本文中，我们提出了一种替代传统的MFCC或滤波器组特征，使用基于广义霍夫变换(GHT)的方法。GHT是图像处理领域中用于目标检测任务的常用方法，其思想是学习相对于目标类位置的特征信息的码本的空间分布。在识别过程中，通常使用码本激活的简单加权求和来检测目标类的存在。在这里，我们提出在人工神经网络中判别性地学习权重，其目的是优化网络输出的静态电话分类误差。由于这种人工神经网络在混合ASR体系结构中很常见，因此GHT的输出激活可以被认为是ASR的一个新特征。在TIMIT音素识别任务上的实验结果证明了该方法的最新性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量