{"title":"A simple and effective pitch re-estimation method for rich prosody and speaking styles in HMM-based speech synthesis","authors":"Chengyuan Lin, Chien-Hung Huang, C. Kuo","doi":"10.1109/ISCSLP.2012.6423473","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423473","url":null,"abstract":"This paper proposes a novel way of controllable pitch re-estimation that can produce better pitch contour or provide diverse speaking styles for text-to-speech (TTS) systems. The method is composed of a pitch re-estimation model and a set of control parameters. The pitch re-estimation model is employed to reduce over-smoothing effects which is usually introduced by TTS training. The control parameters are designed to generate not only rich intonations but also speaking styles, e.g. a foreign accent or an excited tone. To verify the feasibility of the proposed method, we conducted experiments for both objective measures and subjective tests. Although the re-estimated pitch results in only slightly less prediction error for objective measure, it produces clearly better intonation for listening test. Moreover, the expressive speech can be generated successfully under the framework of controllable pitch re-estimation.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127405149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Song Wang, Shen Liu, Jianguo Wei, Qiang Fang, J. Dang
{"title":"Reconstruction of vocal tract based on multi-source image information","authors":"Song Wang, Shen Liu, Jianguo Wei, Qiang Fang, J. Dang","doi":"10.1109/ISCSLP.2012.6423533","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423533","url":null,"abstract":"At present, a variety of instruments recording the articulation have its own pros and cons. None is able to record the data containing all the information of articulators. For example, ultrasound system can obtain main surface information of the tongue, but the images are noisy and cannot record tongue tip under some cases. While the EMA system can precisely record trajectory data of the key points associated with attached sensors on the tongue surface. Therefore, we use EMA and ultrasound simultaneously as a complementary. In this paper, we will use the ultrasound system and the EMA system to record the tongue's movement. We obtain the ultrasound images and the synchronous audio by the ultrasound system; the EMA system is used to collect the EMA data and the synchronous audio. We register and match the ultrasound images and the EMA data by the audio files. And we integrate spatially the ultrasound images and the EMA data of each time point.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131393049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Jia, Wai-Kim Leung, Ye Tian, Lianhong Cai, H. Meng
{"title":"Analysis on mispronunciations in CAPT based on computational speech perception","authors":"Jia Jia, Wai-Kim Leung, Ye Tian, Lianhong Cai, H. Meng","doi":"10.1109/ISCSLP.2012.6423530","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423530","url":null,"abstract":"Computer-aided Pronunciation Training (CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language (L2) learners' speech. In order to further facilitate learning, we aim to be able to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance (PD) between two spoken phonemes. This is used to compute the distance between two phonemes of a target (L2) language. The PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English, i.e. L1 being Chinese (Cantonese) and L2 being US English. These results indicate that auditory confusion indirectly reflects pronunciation confusions in L2 learning. The PD can also be used to help us grade the severity of errors (i.e. mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive named entity recognition based on conditional random fields with automatic updated dynamic gazetteers","authors":"Xixin Wu, Zhiyong Wu, Jia Jia, Lianhong Cai","doi":"10.1109/ISCSLP.2012.6423495","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423495","url":null,"abstract":"This paper presents a hybrid model which combines conditional random fields (CRFs) with dynamic gazetteers (DGs) for the task of Chinese named entity recognition (NER). In the previous work of NER, gazetteers were widely used. But their gazetteers were all static ones which cannot adapt themselves to the new domains and new out-of-vocabulary named entities (OOVNEs). In this work, we build and maintain DGs to solve the problems and propose a method to automatically update DGs along with the recognition process of the named entities (NEs). With this method, the DGs can be updated to contain more and more new NEs and features of NEs that are not found in the training data. These newly added items make the DGs become more aware of the knowledge about new domains and hence be more adaptive to new domains for the recognition of OOVNEs. Experiments on the People's Daily corpus demonstrate that our method is effective, and can improve the average F-score by 1%~2%.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128934322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keyword-specific normalization based keyword spotting for spontaneous speech","authors":"Weifeng Li, Q. Liao","doi":"10.1109/ISCSLP.2012.6423490","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423490","url":null,"abstract":"This paper presents a novel architecture for keyword spotting in spontaneous speech, in which keyword model is trained from a small number of acoustic examples provided by a user. The word-spotting architecture relies on scoring patch feature vector sequences extracted by using sliding windows, and performing keyword-specific normalization and threshold setting. Dynamic time warping (DTW) based template matching and Gaussian Mixture Models (GMM) are proposed to model the keyword, and another GMM is proposed to model the non-keywords. Our keyword spotting experiments demonstrate the effectiveness of the proposed methods. More specifically, the proposed GMM log-likelihood ratio based method achieves about 17% absolute improvement in terms of recall rates compared to the baseline system.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114354921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A preliminary investigation of the third tone sandhi in standard Chinese with a prosodic corpus","authors":"Hongwei Ding, D. Hirst","doi":"10.1109/ISCSLP.2012.6423543","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423543","url":null,"abstract":"In standard Chinese, a low tone (Tone 3) is usually changed into a rising tone (Tone 2) when it is immediately followed by another third tone, which is known as the third tone sandhi. The 3rd tone sandhi has been widely discussed in Chinese phonology. This paper, however, employs a prosodic corpus we are developing to study the acoustic realization of the sandhi rising tones. We find that the magnitude of rising is larger within the disyllabic word boundary than across the boundary. Moreover the tone sandhi is closely related with the prominence of the 3rd tone sandhi syllable, the sandhi tone tends to rise if it is stressed, which implies it is prominence rather than reduction that is one of the main factors for the formation of the 3rd tone sandhi.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127367754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Articulatory and spectral characteristics of Cantonese vowels","authors":"Wai-Sum Lee","doi":"10.1109/ISCSLP.2012.6423472","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423472","url":null,"abstract":"The paper investigates the articulatory and spectral characteristics of the two sets of Cantonese vowels, the long [i: y: u: ε: œ: ⊐: a:] and medium-long [i y u ε œ ⊐ a], using EMA AG500 and CSL4500. Results show the acoustic consequences of up-down and front-back displacements in linguo-palatal constriction are non-linear, as what the quantal theory claims. Moderate displacements in constriction location and constriction size bring about large spectral changes. The level of sensitivity of formant frequencies to variations in articulation is lower in the high point vowels than the non-point vowels, indicating the quantal nature of the high point vowels in Cantonese. In general, for the point and non-point vowels in Cantonese, the variations in formant frequencies in relation to both variations of constriction location and constriction size are similar, which differ from the articulatory-acoustic relations in the English vowels where formant frequencies are relatively more sensitive to variations in constriction size than to constriction location.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125627350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuguang Wang, Hongcui Wang, Jiaqi Gao, Jianguo Wei, J. Dang
{"title":"Detailed morphological analysis of mandarin sustained steady vowels","authors":"Yuguang Wang, Hongcui Wang, Jiaqi Gao, Jianguo Wei, J. Dang","doi":"10.1109/ISCSLP.2012.6423492","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423492","url":null,"abstract":"One of important issues for speech production is to investigate the relation of acoustic features and fine morphological structures of the vocal tract. This study aims to examine morphological characteristics of Mandarin sustained vowels using a female vocal tract MRI data. To do so, image preprocessing, teeth superimposition, segmentation and volume reconstruction are carried out on the MRI volumetric images to extract 3D vocal tract shapes. Then area functions are extracted from vocal tract shapes by re-slicing the vocal tract with a set of grid planes. Nine Mandarin vowels are divided into three groups based on the size rate of pharyngeal/oral cavity. Detailed analysis of these area functions are performed within the groups. The morphological characteristics of the laryngeal cavity and side branches (namely the bilateral piriform fossae, epiglottic valleculae and inter-dental spaces) are also discussed. To evaluate morphological measurements, a comparison is carried out between formants measured from real speech sounds and those calculated ones from these area functions. Results suggested that the calculated formants are consistent with natural speech with a mean error of 4.6%.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121933741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A synchronized pruning composition algorithm of weighted finite state transducers for large vocabulary speech recognition","authors":"Zhiyang He, Ping Lv, Wei Li, Ji Wu","doi":"10.1109/ISCSLP.2012.6423474","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423474","url":null,"abstract":"The use of weighted finite state transducer (WFST) has been a very attractive approach for large vocabulary continuous speech recognition(LVCSR). Composition is an important operation for combining different levels of WFSTs. However, the general composition algorithm may generate non-coaccessible states, which may require a large amount of memory space, especially for LVCSR applications. The general composition algorithm doesn't remove these non-coaccessible states and related transitions until composition is finished. This paper proposes an improved depth-first composition algorithm, which analyzes the property of each new generated state during the composition and removes almost all of the non-coaccessible states and related transitions timely. As a result, the requirement of memory for WFSTs' composition can be significantly decreased. Experimental results on Chinese Broadcast News(41022 words) task show that a reduction of 20% - 26% in memory space can be achieved with an increase of about 5% in the time complexity.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121467480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptually-motivated assessment of automatically detected lexical stress in L2 learners' speech","authors":"Kun Li, H. Meng","doi":"10.1109/ISCSLP.2012.6423520","DOIUrl":"https://doi.org/10.1109/ISCSLP.2012.6423520","url":null,"abstract":"This paper presents a method of automatic lexical stress assessment for L2 English speech. Syllable stress can be labeled at three levels - primary (P), secondary (S) and no (N) stress, but secondary stress may vary among word pronunciations within and across accents and present difficulties for human perception. Hence, evaluation of lexical stress based on all three levels (i.e., the P-S-N criterion which requires that all syllables in a word must be correctly classified in terms of stress) may be too strict, and we may consider relaxing it to either the P-N or A-P-N criterion - the former only requires the correct placement of primary stress, while the latter relaxes further to allow for confusion between primary and secondary stress. An automatic syllable stress detector is applied to L2 learners' speech. Its output for all the syllables in a word is evaluated in terms of the P-S-N, P-N or A-P-N criterion. Comparisons between automatic and manual assessments of lexical stress patterns suggests that the A-P-N criterion can strike a good balance between accommodating variability and screening out problematic patterns, giving an average word accuracy of 79.6%.","PeriodicalId":186099,"journal":{"name":"2012 8th International Symposium on Chinese Spoken Language Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129536493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}