{"title":"Audiovisual integration of speech by children and adults with cochlear implants","authors":"K. Kirk, D. Pisoni, Lorin Lachs","doi":"10.21437/ICSLP.2002-427","DOIUrl":"https://doi.org/10.21437/ICSLP.2002-427","url":null,"abstract":"The present study examined how prelingually deafened children and postlingually deafened adults with cochlear implants (CIs) combine visual speech information with auditory cues. Performance was assessed under auditory-alone (A), visual- alone (V), and combined audiovisual (AV) presentation formats. A measure of visual enhancement, RA, was used to assess the gain in performance provided in the AV condition relative to the maximum possible performance in the auditory-alone format. Word recogniton was highest for AV presentation followed by A and V, respectively. Children who received more visual enhancement also produced more intelligible speech. Adults with CIs made better use of visual information in more difficult listening conditions (e.g., when mutiple talkers or phonemically similar words were used). The findings are discussed in terms of the complementary nature of auditory and visual sources of information that specify the same underlying gestures and articulatory events in speech.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"8 1","pages":"1689-1692"},"PeriodicalIF":0.0,"publicationDate":"2002-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79175257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AUDIOVISUAL INTEGRATION OF SPEECH BY CHILDREN AND ADULTS WITH COCHEAR IMPLANTS.","authors":"Karen Iler Kirk, David B Pisoni, Lorin Lachs","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The present study examined how prelingually deafened children and postlingually deafened adults with cochlear implants (CIs) combine visual speech information with auditory cues. Performance was assessed under auditory-alone (A), visual- alone (V), and combined audiovisual (AV) presentation formats. A measure of visual enhancement, R<sub>A</sub>, was used to assess the gain in performance provided in the AV condition relative to the maximum possible performance in the auditory-alone format. Word recogniton was highest for AV presentation followed by A and V, respectively. Children who received more visual enhancement also produced more intelligible speech. Adults with CIs made better use of visual information in more difficult listening conditions (e.g., when mutiple talkers or phonemically similar words were used). The findings are discussed in terms of the complementary nature of auditory and visual sources of information that specify the same underlying gestures and articulatory events in speech.</p>","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"2002 ","pages":"1689-1692"},"PeriodicalIF":0.0,"publicationDate":"2002-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4214155/pdf/nihms410773.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32786798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sproat, A. Hunt, Mari Ostendorf, P. Taylor, A. Black, K. Lenzo, M. Edgington
{"title":"SABLE: a standard for TTS markup","authors":"R. Sproat, A. Hunt, Mari Ostendorf, P. Taylor, A. Black, K. Lenzo, M. Edgington","doi":"10.21437/ICSLP.1998-14","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-14","url":null,"abstract":"Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers. SABLE is an XML/SGML-based markup scheme for text-to-speech synthesis, developed to address the need for a common TTS control paradigm. This paper presents an overview of the SABLE specification, and provides links to sites where further information on SABLE can be accessed.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"38 1","pages":"27-30"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76325810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient adaptation of TTS duration model to new speakers","authors":"Chilin Shih, Wentao Gu, J. V. Santen","doi":"10.21437/ICSLP.1998-5","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-5","url":null,"abstract":"This paper discusses a methodology using a minimal set of sentences to adapt an existing TTS duration model to capture interspeaker variations. The assumption is that the original duration database contains information of both language-specific and speaker-specific duration characteristics. In training a duration model for a new speaker, only the speaker-specific information needs to be modeled, therefore the size of the training data can be reduced drastically. Results from several experiments are compared and discussed.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"151 1","pages":"105-110"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75407984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A three-dimensional linear articulatory model based on MRI data","authors":"P. Badin, G. Bailly, M. Raybaudi, C. Segebarth","doi":"10.21437/ICSLP.1998-353","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-353","url":null,"abstract":"Based on a set of 3D vocal tract images obtained by MRI, a 3D statistical articulatory model has been built using guided Principal Component Analysis. It constitutes an extension to the lateral dimension of the mid-sagittal model previously developed from a radiofilm recorded on the same subject. The parameters of the 2D model have been found to be good predictors of the 3D shapes, for most configurations. A first evaluation of the model in terms of area functions and formants is presented.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"32 1","pages":"249-254"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90477926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. F. G. D. Freitas, S. E. Johnson, M. Niranjan, A. Gee
{"title":"Global optimisation of neural network models via sequential sampling-importance resampling","authors":"J. F. G. D. Freitas, S. E. Johnson, M. Niranjan, A. Gee","doi":"10.21437/ICSLP.1998-412","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-412","url":null,"abstract":"We propose a novel strategy for training neural networks using sequential Monte Carlo algorithms. This global optimisation strategy allows us to learn the probability distribution of the network weights in a sequential framework. It is well suited to applications involving on-line, nonlinear or non-stationary signal processing. We show how the new algorithms can outperform extended Kalman filter (EKF) training.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"58 1","pages":"410-416"},"PeriodicalIF":0.0,"publicationDate":"1998-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82063926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word boundary detection using pitch variations","authors":"V. R. Gadde, J. Srichand","doi":"10.21437/ICSLP.1996-211","DOIUrl":"https://doi.org/10.21437/ICSLP.1996-211","url":null,"abstract":"This paper proposes a method for detecting word boundaries. This method is based on the behaviour of the pitch frequency across the sentences. The pitch frequency (F 0 ) is found to rise in a word and fall to the next word. The presence of this fall is proposed as a means of detecting word boundaries. Four major Indian languages are used and the results show that nearly 85% of the word boundaries were correctly detected. The same method used for German language shows that nearly 65% of the word boundaries were correctly detected. The implicationsof these result in the development of a continuous speech recognition system are discussed.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"29 1 1","pages":"813-816"},"PeriodicalIF":0.0,"publicationDate":"1996-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78012504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch","authors":"S. Vuuren","doi":"10.21437/ICSLP.1996-454","DOIUrl":"https://doi.org/10.21437/ICSLP.1996-454","url":null,"abstract":"We compare speaker recognition performance of vector quantization (VQ), Gaussian mixture modeling (GMM) and the Arithmetic Harmonic Sphericity measure (AHS) in adverse telephone speech conditions. The aim is to address the question: how do multimodal VQ and GMM typically compare to the simpler unimodal AHS for matched and mismatched training and testing environments? We study identification (closed set) and verification errors on a new multi environment database. We consider LPC and PLP features as well as their RASTA derivatives. We conclude that RASTA processing can remove redundancies from the features. We affirm that even when we use channel and noise compensation schemes, speaker recognition errors remain high when there is acoustic mismatch.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"25 1","pages":"1788-1791"},"PeriodicalIF":0.0,"publicationDate":"1996-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83066549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distinction between 'normal' focus and 'contrastive/emphatic' focus","authors":"A. Elsner","doi":"10.21437/ICSLP.1996-162","DOIUrl":"https://doi.org/10.21437/ICSLP.1996-162","url":null,"abstract":"","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"6 1","pages":"642-645"},"PeriodicalIF":0.0,"publicationDate":"1996-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74944446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relationship between discourse structure and dynamic speech rate","authors":"F. J. K. Beinum, M. E. V. Donzel","doi":"10.21437/ICSLP.1996-438","DOIUrl":"https://doi.org/10.21437/ICSLP.1996-438","url":null,"abstract":"This paper regards one specific element of a larger research project on the acoustic determinants of information structure in spontaneous and read discourse in Dutch. From a previous experiment within that project it turned out that listeners used two main cues (viz. speaking rate and intonation) to differentiate between spontaneous and read speech. The aim of the present experiment is to investigate the role of one of these prosodic cues, i.e., the local variability in speaking rate, and to study the relationship between the information structure of a spoken discourse on the one hand, and dynamic speaking rate measurements of that discourse on the other hand. Results show that there is a large variability in average syllable duration over the various interpausal speech runs for each of the eight speakers. No straightforward relation is found between the number of syllables within a run and the average syllable duration. We hypothesize that, at least in spontaneous speech, variations in speaking rate are related to the (global and/or local) information structures in the discourse. Global analysis of the discourse structure in paragraphs and clauses reveals that for each of the speakers the average syllable duration of the first run of a paragraph is longer than the overall mean value per speaker in more than 60% of the cases. Inspection of the quartiles of runs with highest ASD-values and those with lowest ASD-values for each of the speakers shows quite different structures, which can be explained on the basis of partly local and partly global discourse characteristics.","PeriodicalId":90685,"journal":{"name":"Proceedings : ICSLP. International Conference on Spoken Language Processing","volume":"3 1","pages":"1724-1727"},"PeriodicalIF":0.0,"publicationDate":"1996-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74040501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}