{"title":"Isoilated word recognition of deaf speech using tewe delay neiworks","authors":"R. Kota, K. Abdelhamied, E. Goshorn","doi":"10.1109/IEMBS.1993.979177","DOIUrl":null,"url":null,"abstract":"A prototype system for deaf speech recognition using time delay neural networks is proposed. The prototype system uses spectral information and other features that are known to be present in deaf speech. The network was trained using the backpropogation learning rule on a vocabulary of 20 words selected from the Modified Rhyme Test. The prototype system was tested using the speech of two profoundly deaf speakers. For speaker 1 with a speech intelligibility rating of 75.6% the network gave a peak recognition rate of 85%. For speaker 2 with a speech intelligibility rating of 35.6% the network gave a peak recognition rate of 31%. For similar recognition tasks, the IntrovoiceTM system was used to evaluate the performance of the prototype system resulting in a peak recognition rate of 60% and 18% for speaker 1 and 2 respectively. XNTRODUCI'ION There is a practical need for voice input communication aids that can reliably recognize deaf speech in real time [l]. Such aids could serve the communication needs of deaf speakers by converting unintelligible speech into printed displays or synthetic speech for use as a voice input communication system [2). Despite the large variability in deaf speech, listeners who have adjusted to the overall speech production patterns of an individual have few problems in understanding it [3]. Deviations in deaf speech include consonant substitution, vowel neutralization, utterance prolongation, voicing/unvoicing, stressing/unstressing errors etc. These errors do not occur in a random way but may reflect a different type of coding structure in producing speech [4]. Previous studies have shown that it is possible to identify consistent acoustic features to account for the variability in deaf speech [q. Using these features could improve the recognition accuracy. Neural networks have been shown to perform pattern recognition tasks such as speech recognition successfully. There is evidence that time delay neural networks can tolerate variations in the phonemic environment 16). These variations can be related to substitution and prolongation errors commonly occurring in deaf speech. METHODS Two congenitally deaf adult male speakers who had sensorineural hearing loss of 9OdB HL or more in the frequency range 200-8oOo Hz were selected. A vocabulary of twenty words was selected from list F of the Modified Rhyme Test (MRT) test [7]. Each speaker produced each word twenty six times across two recording sessions that were spaced one month apart. Five nonnai hearing listeners participated in the intelligibility testing. The order of listening task was randomized and the replay of speech samples were also randomized. Each listener was asked to select one word from a closed set of six rhyming words. The testing procedure simulated the presence of a familiar listener and had the advantage of eliminating the learning time required by other test schemes. The percentage of words correctly identified by each listener was calculated as the intelligibility rating for each speaker. Intelligibility ratings scored by the five listeners were then averaged. Speaker 1 and 2 had speech intelligibility ratings of 75.6% and 35.6% respectively. Speech recordings were bandpass fdtered between 80 and 4700 Hz and digitized at 10 kHz. An eighth order FFT was applied to each frame of speech consisting of 256 points. Data reduction techniques were applied to the FFT outputs which were then log compressed to yield 16 spectral energy values per frame. For each frame, the additional features extracted from speech were short-time zerocrossing rates, and log-magnitude energy. Abnormal pauses in speech were also located when the speech energy dropped below a set threshold for more than 51.2 ms. A time delay network [8] was modified to incorporate the additional features. The time discretization was increased in each input window to account for vowel and consonant prolongations in deaf speech. The network was trained using the backpropogation learning rule, in an incremental fashion with increasing numbers of training tokens in the range 5-21. Each network was tested for recognition accuracy at regular check points with 5 testing tokens. RESULTS AND DISCUSSIONS improves the recognition rate by up to 8%. The network The results indicated that using additional features 0-7803-1377-1/93 $3.00 01993 IEEE 1361 +no A d l l r h o -+M& rohra *nrarpo Spiaa Figure 1. Recognition rates for speaker 1 + . . . . . , . . . I ' 5 5 7 9 I 1 I $ 15 17 19 21 25 yntr of k&+q !dn","PeriodicalId":408657,"journal":{"name":"Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1993-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEMBS.1993.979177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A prototype system for deaf speech recognition using time delay neural networks is proposed. The prototype system uses spectral information and other features that are known to be present in deaf speech. The network was trained using the backpropogation learning rule on a vocabulary of 20 words selected from the Modified Rhyme Test. The prototype system was tested using the speech of two profoundly deaf speakers. For speaker 1 with a speech intelligibility rating of 75.6% the network gave a peak recognition rate of 85%. For speaker 2 with a speech intelligibility rating of 35.6% the network gave a peak recognition rate of 31%. For similar recognition tasks, the IntrovoiceTM system was used to evaluate the performance of the prototype system resulting in a peak recognition rate of 60% and 18% for speaker 1 and 2 respectively. XNTRODUCI'ION There is a practical need for voice input communication aids that can reliably recognize deaf speech in real time [l]. Such aids could serve the communication needs of deaf speakers by converting unintelligible speech into printed displays or synthetic speech for use as a voice input communication system [2). Despite the large variability in deaf speech, listeners who have adjusted to the overall speech production patterns of an individual have few problems in understanding it [3]. Deviations in deaf speech include consonant substitution, vowel neutralization, utterance prolongation, voicing/unvoicing, stressing/unstressing errors etc. These errors do not occur in a random way but may reflect a different type of coding structure in producing speech [4]. Previous studies have shown that it is possible to identify consistent acoustic features to account for the variability in deaf speech [q. Using these features could improve the recognition accuracy. Neural networks have been shown to perform pattern recognition tasks such as speech recognition successfully. There is evidence that time delay neural networks can tolerate variations in the phonemic environment 16). These variations can be related to substitution and prolongation errors commonly occurring in deaf speech. METHODS Two congenitally deaf adult male speakers who had sensorineural hearing loss of 9OdB HL or more in the frequency range 200-8oOo Hz were selected. A vocabulary of twenty words was selected from list F of the Modified Rhyme Test (MRT) test [7]. Each speaker produced each word twenty six times across two recording sessions that were spaced one month apart. Five nonnai hearing listeners participated in the intelligibility testing. The order of listening task was randomized and the replay of speech samples were also randomized. Each listener was asked to select one word from a closed set of six rhyming words. The testing procedure simulated the presence of a familiar listener and had the advantage of eliminating the learning time required by other test schemes. The percentage of words correctly identified by each listener was calculated as the intelligibility rating for each speaker. Intelligibility ratings scored by the five listeners were then averaged. Speaker 1 and 2 had speech intelligibility ratings of 75.6% and 35.6% respectively. Speech recordings were bandpass fdtered between 80 and 4700 Hz and digitized at 10 kHz. An eighth order FFT was applied to each frame of speech consisting of 256 points. Data reduction techniques were applied to the FFT outputs which were then log compressed to yield 16 spectral energy values per frame. For each frame, the additional features extracted from speech were short-time zerocrossing rates, and log-magnitude energy. Abnormal pauses in speech were also located when the speech energy dropped below a set threshold for more than 51.2 ms. A time delay network [8] was modified to incorporate the additional features. The time discretization was increased in each input window to account for vowel and consonant prolongations in deaf speech. The network was trained using the backpropogation learning rule, in an incremental fashion with increasing numbers of training tokens in the range 5-21. Each network was tested for recognition accuracy at regular check points with 5 testing tokens. RESULTS AND DISCUSSIONS improves the recognition rate by up to 8%. The network The results indicated that using additional features 0-7803-1377-1/93 $3.00 01993 IEEE 1361 +no A d l l r h o -+M& rohra *nrarpo Spiaa Figure 1. Recognition rates for speaker 1 + . . . . . , . . . I ' 5 5 7 9 I 1 I $ 15 17 19 21 25 yntr of k&+q !dn