Isoilated word recognition of deaf speech using tewe delay neiworks

R. Kota, K. Abdelhamied, E. Goshorn
{"title":"Isoilated word recognition of deaf speech using tewe delay neiworks","authors":"R. Kota, K. Abdelhamied, E. Goshorn","doi":"10.1109/IEMBS.1993.979177","DOIUrl":null,"url":null,"abstract":"A prototype system for deaf speech recognition using time delay neural networks is proposed. The prototype system uses spectral information and other features that are known to be present in deaf speech. The network was trained using the backpropogation learning rule on a vocabulary of 20 words selected from the Modified Rhyme Test. The prototype system was tested using the speech of two profoundly deaf speakers. For speaker 1 with a speech intelligibility rating of 75.6% the network gave a peak recognition rate of 85%. For speaker 2 with a speech intelligibility rating of 35.6% the network gave a peak recognition rate of 31%. For similar recognition tasks, the IntrovoiceTM system was used to evaluate the performance of the prototype system resulting in a peak recognition rate of 60% and 18% for speaker 1 and 2 respectively. XNTRODUCI'ION There is a practical need for voice input communication aids that can reliably recognize deaf speech in real time [l]. Such aids could serve the communication needs of deaf speakers by converting unintelligible speech into printed displays or synthetic speech for use as a voice input communication system [2). Despite the large variability in deaf speech, listeners who have adjusted to the overall speech production patterns of an individual have few problems in understanding it [3]. Deviations in deaf speech include consonant substitution, vowel neutralization, utterance prolongation, voicing/unvoicing, stressing/unstressing errors etc. These errors do not occur in a random way but may reflect a different type of coding structure in producing speech [4]. Previous studies have shown that it is possible to identify consistent acoustic features to account for the variability in deaf speech [q. Using these features could improve the recognition accuracy. Neural networks have been shown to perform pattern recognition tasks such as speech recognition successfully. There is evidence that time delay neural networks can tolerate variations in the phonemic environment 16). These variations can be related to substitution and prolongation errors commonly occurring in deaf speech. METHODS Two congenitally deaf adult male speakers who had sensorineural hearing loss of 9OdB HL or more in the frequency range 200-8oOo Hz were selected. A vocabulary of twenty words was selected from list F of the Modified Rhyme Test (MRT) test [7]. Each speaker produced each word twenty six times across two recording sessions that were spaced one month apart. Five nonnai hearing listeners participated in the intelligibility testing. The order of listening task was randomized and the replay of speech samples were also randomized. Each listener was asked to select one word from a closed set of six rhyming words. The testing procedure simulated the presence of a familiar listener and had the advantage of eliminating the learning time required by other test schemes. The percentage of words correctly identified by each listener was calculated as the intelligibility rating for each speaker. Intelligibility ratings scored by the five listeners were then averaged. Speaker 1 and 2 had speech intelligibility ratings of 75.6% and 35.6% respectively. Speech recordings were bandpass fdtered between 80 and 4700 Hz and digitized at 10 kHz. An eighth order FFT was applied to each frame of speech consisting of 256 points. Data reduction techniques were applied to the FFT outputs which were then log compressed to yield 16 spectral energy values per frame. For each frame, the additional features extracted from speech were short-time zerocrossing rates, and log-magnitude energy. Abnormal pauses in speech were also located when the speech energy dropped below a set threshold for more than 51.2 ms. A time delay network [8] was modified to incorporate the additional features. The time discretization was increased in each input window to account for vowel and consonant prolongations in deaf speech. The network was trained using the backpropogation learning rule, in an incremental fashion with increasing numbers of training tokens in the range 5-21. Each network was tested for recognition accuracy at regular check points with 5 testing tokens. RESULTS AND DISCUSSIONS improves the recognition rate by up to 8%. The network The results indicated that using additional features 0-7803-1377-1/93 $3.00 01993 IEEE 1361 +no A d l l r h o -+M& rohra *nrarpo Spiaa Figure 1. Recognition rates for speaker 1 + . . . . . , . . . I ' 5 5 7 9 I 1 I $ 15 17 19 21 25 yntr of k&+q !dn","PeriodicalId":408657,"journal":{"name":"Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1993-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th Annual International Conference of the IEEE Engineering in Medicine and Biology Societ","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEMBS.1993.979177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

A prototype system for deaf speech recognition using time delay neural networks is proposed. The prototype system uses spectral information and other features that are known to be present in deaf speech. The network was trained using the backpropogation learning rule on a vocabulary of 20 words selected from the Modified Rhyme Test. The prototype system was tested using the speech of two profoundly deaf speakers. For speaker 1 with a speech intelligibility rating of 75.6% the network gave a peak recognition rate of 85%. For speaker 2 with a speech intelligibility rating of 35.6% the network gave a peak recognition rate of 31%. For similar recognition tasks, the IntrovoiceTM system was used to evaluate the performance of the prototype system resulting in a peak recognition rate of 60% and 18% for speaker 1 and 2 respectively. XNTRODUCI'ION There is a practical need for voice input communication aids that can reliably recognize deaf speech in real time [l]. Such aids could serve the communication needs of deaf speakers by converting unintelligible speech into printed displays or synthetic speech for use as a voice input communication system [2). Despite the large variability in deaf speech, listeners who have adjusted to the overall speech production patterns of an individual have few problems in understanding it [3]. Deviations in deaf speech include consonant substitution, vowel neutralization, utterance prolongation, voicing/unvoicing, stressing/unstressing errors etc. These errors do not occur in a random way but may reflect a different type of coding structure in producing speech [4]. Previous studies have shown that it is possible to identify consistent acoustic features to account for the variability in deaf speech [q. Using these features could improve the recognition accuracy. Neural networks have been shown to perform pattern recognition tasks such as speech recognition successfully. There is evidence that time delay neural networks can tolerate variations in the phonemic environment 16). These variations can be related to substitution and prolongation errors commonly occurring in deaf speech. METHODS Two congenitally deaf adult male speakers who had sensorineural hearing loss of 9OdB HL or more in the frequency range 200-8oOo Hz were selected. A vocabulary of twenty words was selected from list F of the Modified Rhyme Test (MRT) test [7]. Each speaker produced each word twenty six times across two recording sessions that were spaced one month apart. Five nonnai hearing listeners participated in the intelligibility testing. The order of listening task was randomized and the replay of speech samples were also randomized. Each listener was asked to select one word from a closed set of six rhyming words. The testing procedure simulated the presence of a familiar listener and had the advantage of eliminating the learning time required by other test schemes. The percentage of words correctly identified by each listener was calculated as the intelligibility rating for each speaker. Intelligibility ratings scored by the five listeners were then averaged. Speaker 1 and 2 had speech intelligibility ratings of 75.6% and 35.6% respectively. Speech recordings were bandpass fdtered between 80 and 4700 Hz and digitized at 10 kHz. An eighth order FFT was applied to each frame of speech consisting of 256 points. Data reduction techniques were applied to the FFT outputs which were then log compressed to yield 16 spectral energy values per frame. For each frame, the additional features extracted from speech were short-time zerocrossing rates, and log-magnitude energy. Abnormal pauses in speech were also located when the speech energy dropped below a set threshold for more than 51.2 ms. A time delay network [8] was modified to incorporate the additional features. The time discretization was increased in each input window to account for vowel and consonant prolongations in deaf speech. The network was trained using the backpropogation learning rule, in an incremental fashion with increasing numbers of training tokens in the range 5-21. Each network was tested for recognition accuracy at regular check points with 5 testing tokens. RESULTS AND DISCUSSIONS improves the recognition rate by up to 8%. The network The results indicated that using additional features 0-7803-1377-1/93 $3.00 01993 IEEE 1361 +no A d l l r h o -+M& rohra *nrarpo Spiaa Figure 1. Recognition rates for speaker 1 + . . . . . , . . . I ' 5 5 7 9 I 1 I $ 15 17 19 21 25 yntr of k&+q !dn
基于延迟网络的聋人言语孤立词识别
提出了一种基于时滞神经网络的聋人语音识别原型系统。原型系统使用频谱信息和其他已知存在于聋人语言中的特征。该网络使用反向传播学习规则对从修改押韵测试中选择的20个单词进行训练。这个原型系统用两个重度失聪的说话者的语言进行了测试。对于语音清晰度评级为75.6%的说话者1,网络给出的峰值识别率为85%。对于语音清晰度评级为35.6%的说话人2,网络给出的峰值识别率为31%。对于类似的识别任务,使用IntrovoiceTM系统对原型系统的性能进行评估,结果对说话人1和2的峰值识别率分别为60%和18%。对实时可靠识别聋人语音的语音输入通信辅助设备有实际需求[1]。这种辅助设备可以通过将听不懂的语音转换成印刷显示或合成语音作为语音输入通信系统来满足聋哑人的交流需求[2]。尽管聋人的语言有很大的差异,但那些适应了一个人的整体语言产生模式的听者在理解它时几乎没有什么问题。聋人言语的偏差包括辅音替代、元音中和、话语延长、发声/不发声、重音/不重音错误等。这些错误并不是随机发生的,而是在产生语音的过程中反映了不同类型的编码结构。先前的研究表明,识别一致的声学特征来解释聋人语言的可变性是可能的。利用这些特征可以提高识别精度。神经网络已被证明可以成功地执行模式识别任务,如语音识别。有证据表明,时间延迟神经网络能够忍受音素环境的变化。这些变化可能与耳聋言语中常见的替换和延长错误有关。方法选择200 ~ 8ooo Hz频率范围内9OdB HL及以上感音神经性听力损失的成年男性先天性聋人2例。从调韵测试(MRT)测试[7]的表F中选择了20个单词。每个讲话者在间隔一个月的两次录音中重复每个单词26次。5名非nai听力听者参与了可理解性测试。听力任务的顺序是随机的,语音样本的回放也是随机的。每位听众被要求从一组封闭的六个押韵词中选择一个词。该测试过程模拟了熟悉的听众的存在,其优点是消除了其他测试方案所需的学习时间。每个听众正确识别的单词百分比被计算为每个说话者的可理解度评级。然后将五位听众的可理解性评分取平均值。说话者1和2的语音清晰度评分分别为75.6%和35.6%。语音记录在80和4700赫兹之间进行带通滤波,并在10千赫进行数字化。对由256个点组成的每一帧语音应用8阶FFT。数据缩减技术应用于FFT输出,然后进行日志压缩以产生每帧16个光谱能量值。对于每一帧,从语音中提取的附加特征是短时过零率和对数量级能量。当语音能量低于设定阈值超过51.2 ms时,也可以定位语音异常停顿。一个时间延迟网络[8]被修改以包含额外的特性。在每个输入窗口中增加时间离散化,以解释聋人语音中元音和辅音的延长。使用反向传播学习规则以增量方式训练网络,训练令牌数量在5-21范围内增加。每个网络在常规检查点用5个测试令牌测试识别准确性。结果:提高了8%的识别率。该网络的结果表明,使用额外的特征0-7803-1377-1/93 $3.00 01993 IEEE 1361 +no A和l l l r h o -+M& rohra *nrarpo Spiaa图1。演讲者1 + . . . . .的识别率……我是5美元,我是5美元,我是15美元,我是17美元,我是21美元,我是25美元
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信