语音产生和感知机制建模及其在合成、识别和编码中的应用

ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359) Pub Date : 1999-08-22 DOI:10.1109/ISSPA.1999.818096

A. Alwan

{"title":"语音产生和感知机制建模及其在合成、识别和编码中的应用","authors":"A. Alwan","doi":"10.1109/ISSPA.1999.818096","DOIUrl":null,"url":null,"abstract":"Summary form only given, as follows. Quantitative models of human speech production and perception mechanisms provide important insights into our cognitive abilities and can lead to high-quality speech synthesis, robust automatic speech recognition and coding schemes, and better speech and hearing prostheses. Some of our research activities in these two areas are described. Our speech production work involved collecting, and analyzing magnetic resonance images (MRI), acoustic recordings, and electropalatography (EPG) data from talkers of American English during speech production. The articulatory database is the largest of its kind in the world and contains the first images of liquids (such as /I/ and /r/) and fricatives (such as /s/ and /sh) for both male and female talkers. MR images are useful for characterizing the 3D geometry of the vocal tract (VT) and for measuring lengths, area functions, and volumes. EPG is used to study inter- and intra-speaker variabilities in the articulatory dynamics, while acoustic recordings are necessary for modeling. Inter- and intra-speaker characteristics of the VT and tongue shapes will be illustrated for various speech sounds, as well as results of acoustic modeling based on the MRI and acoustic data. The implications of our findings on vocal-tract normalization schemes and speech synthesis are also discussed. In the speech perception area, aspects of auditory signal processing and speech perception are parameterized and implemented in a speech recognition system. Our models parameterize the sensitivity to spectral dynamics and local peak frequency positions in the speech signal. These cues remain robust when listening to speech in noise. Recognition evaluations using the dynamic model with a stochastic hidden Markov model (HMM) recognition system showed increased robustness to noise over other state-of-the-art representations. The applications of auditory modeling to speech coding are discussed. We developed an embedded and perceptually-based speech and audio coder. Perceptual metrics are used to ensure that encoding is optimized to the human listener and is based on calculating the signal-to-mask ratio in short-time frames of the input signal. An adaptive bit allocation scheme is employed and the subband energies are then quantized. The coder is variable-rate, noise-robust and suitable for wireless communications.","PeriodicalId":302569,"journal":{"name":"ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Modeling speech production and perception mechanisms and their applications to synthesis, recognition, and coding\",\"authors\":\"A. Alwan\",\"doi\":\"10.1109/ISSPA.1999.818096\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given, as follows. Quantitative models of human speech production and perception mechanisms provide important insights into our cognitive abilities and can lead to high-quality speech synthesis, robust automatic speech recognition and coding schemes, and better speech and hearing prostheses. Some of our research activities in these two areas are described. Our speech production work involved collecting, and analyzing magnetic resonance images (MRI), acoustic recordings, and electropalatography (EPG) data from talkers of American English during speech production. The articulatory database is the largest of its kind in the world and contains the first images of liquids (such as /I/ and /r/) and fricatives (such as /s/ and /sh) for both male and female talkers. MR images are useful for characterizing the 3D geometry of the vocal tract (VT) and for measuring lengths, area functions, and volumes. EPG is used to study inter- and intra-speaker variabilities in the articulatory dynamics, while acoustic recordings are necessary for modeling. Inter- and intra-speaker characteristics of the VT and tongue shapes will be illustrated for various speech sounds, as well as results of acoustic modeling based on the MRI and acoustic data. The implications of our findings on vocal-tract normalization schemes and speech synthesis are also discussed. In the speech perception area, aspects of auditory signal processing and speech perception are parameterized and implemented in a speech recognition system. Our models parameterize the sensitivity to spectral dynamics and local peak frequency positions in the speech signal. These cues remain robust when listening to speech in noise. Recognition evaluations using the dynamic model with a stochastic hidden Markov model (HMM) recognition system showed increased robustness to noise over other state-of-the-art representations. The applications of auditory modeling to speech coding are discussed. We developed an embedded and perceptually-based speech and audio coder. Perceptual metrics are used to ensure that encoding is optimized to the human listener and is based on calculating the signal-to-mask ratio in short-time frames of the input signal. An adaptive bit allocation scheme is employed and the subband energies are then quantized. The coder is variable-rate, noise-robust and suitable for wireless communications.\",\"PeriodicalId\":302569,\"journal\":{\"name\":\"ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSPA.1999.818096\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSPA.1999.818096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

仅给出摘要形式，如下。人类语音产生和感知机制的定量模型为我们的认知能力提供了重要的见解，可以导致高质量的语音合成，强大的自动语音识别和编码方案，以及更好的语音和听力假肢。本文描述了我们在这两个领域的一些研究活动。我们的语音生成工作包括收集和分析美国英语说话者在语音生成过程中的磁共振图像(MRI)、录音和腭电图(EPG)数据。该发音数据库是世界上同类数据库中最大的，包含了男性和女性说话者的第一批液体(如/I/和/r/)和摩擦音(如/s/和/sh)图像。MR图像对于描述声道的三维几何形状和测量长度、面积函数和体积是有用的。EPG用于研究说话人之间和说话人内部的发音动态变化，而声学记录是建模所必需的。本文将阐述不同语音的舌速和舌形的说话人间和说话人内特征，以及基于MRI和声学数据的声学建模结果。我们的研究结果对声道规范化方案和语音合成的影响也进行了讨论。在语音感知领域，将听觉信号处理和语音感知两个方面参数化，并在语音识别系统中实现。我们的模型参数化了对频谱动力学和语音信号局部峰值频率位置的敏感性。这些线索在听有噪音的讲话时仍然很强大。使用动态模型和随机隐马尔可夫模型(HMM)识别系统的识别评估表明，与其他最先进的表示相比，该系统对噪声具有更高的鲁棒性。讨论了听觉建模在语音编码中的应用。我们开发了一个嵌入式和基于感知的语音和音频编码器。感知度量用于确保编码对人类听者进行优化，并且基于在输入信号的短时间帧中计算信号与掩码比。采用自适应比特分配方案，对子带能量进行量化。该编码器具有可变速率、抗噪性好、适用于无线通信的特点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Modeling speech production and perception mechanisms and their applications to synthesis, recognition, and coding

Summary form only given, as follows. Quantitative models of human speech production and perception mechanisms provide important insights into our cognitive abilities and can lead to high-quality speech synthesis, robust automatic speech recognition and coding schemes, and better speech and hearing prostheses. Some of our research activities in these two areas are described. Our speech production work involved collecting, and analyzing magnetic resonance images (MRI), acoustic recordings, and electropalatography (EPG) data from talkers of American English during speech production. The articulatory database is the largest of its kind in the world and contains the first images of liquids (such as /I/ and /r/) and fricatives (such as /s/ and /sh) for both male and female talkers. MR images are useful for characterizing the 3D geometry of the vocal tract (VT) and for measuring lengths, area functions, and volumes. EPG is used to study inter- and intra-speaker variabilities in the articulatory dynamics, while acoustic recordings are necessary for modeling. Inter- and intra-speaker characteristics of the VT and tongue shapes will be illustrated for various speech sounds, as well as results of acoustic modeling based on the MRI and acoustic data. The implications of our findings on vocal-tract normalization schemes and speech synthesis are also discussed. In the speech perception area, aspects of auditory signal processing and speech perception are parameterized and implemented in a speech recognition system. Our models parameterize the sensitivity to spectral dynamics and local peak frequency positions in the speech signal. These cues remain robust when listening to speech in noise. Recognition evaluations using the dynamic model with a stochastic hidden Markov model (HMM) recognition system showed increased robustness to noise over other state-of-the-art representations. The applications of auditory modeling to speech coding are discussed. We developed an embedded and perceptually-based speech and audio coder. Perceptual metrics are used to ensure that encoding is optimized to the human listener and is based on calculating the signal-to-mask ratio in short-time frames of the input signal. An adaptive bit allocation scheme is employed and the subband energies are then quantized. The coder is variable-rate, noise-robust and suitable for wireless communications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)

自引率

0.00%

发文量