{"title":"走向自下而上的连续电话识别","authors":"S. Siniscalchi, T. Svendsen, Chin-Hui Lee","doi":"10.1109/ASRU.2007.4430174","DOIUrl":null,"url":null,"abstract":"We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Towards bottom-up continuous phone recognition\",\"authors\":\"S. Siniscalchi, T. Svendsen, Chin-Hui Lee\",\"doi\":\"10.1109/ASRU.2007.4430174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.\",\"PeriodicalId\":371729,\"journal\":{\"name\":\"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2007.4430174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2007.4430174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
We present a novel approach to designing bottom-up automatic speech recognition (ASR) systems. The key component of the proposed approach is a bank of articulatory attribute detectors implemented using a set of feed-forward artificial neural networks (ANNs). Each detector computes a score describing an activation level of the specified speech attributes that the current frame exhibits. These cues are first combined by an event merger that provides some evidence about the presence of a higher level feature which is then verified by an evidence verifier to produce hypotheses at the phone or word level. We evaluate several configurations of our proposed system on a continuous phone recognition task. Experimental results on the TIMIT database show that the system achieves a phone error rate of 25% which is superior to results obtained with either hidden Markov model (HMM) or conditional random field (CRF) based recognizers. We believe the system's inherent flexibility and the ease of adding new detectors may provide further improvements.