{"title":"Effects of Mandarin Tones on Acoustic Cue Weighting Patterns for Prominence","authors":"Wei Zhang, Meghan Clayards, Jinsong Zhang","doi":"10.1109/ISCSLP49672.2021.9362105","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362105","url":null,"abstract":"As a crucial perceived trait of speech, prominence is associated with various communicative functions. The acoustic cues for prominence have been explored extensively, but how the cue weighting pattern varies in different contexts needs a closer examination. This paper investigates how Mandarin tones affect the cue weighting pattern of prominence. On the basis of an annotated speech corpus, mixed-effect logistic regression (MELR) models were fitted for prominence of Mandarin syllables in each lexical tone, by taking fundamental frequency (F0), intensity, duration and formant features as independent variables, and the presence/absence of prominence as the dependent variable. Results showed the varying cue weighting patterns in different tones: (1) The first formant F1 contributes only to level tones T1/T3, and its contribution is much smaller than prosodic features; (2) For H-onset tones F0 features contribute more than intensity, while for L-onset tones F0 features contribute less than intensity; (3) For dynamic tones T2/T4, not the maximum but the minimum of F0 has contribution; (4) Duration has the largest contribution to T4, which is intrinsically shorter than other three tones","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116089766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Qin, Yao Qian, Anastassia Loukina, P. Lange, A. Misra, Keelan Evanini, Tan Lee
{"title":"Automatic Detection of Word-Level Reading Errors in Non-native English Speech Based on ASR Output","authors":"Ying Qin, Yao Qian, Anastassia Loukina, P. Lange, A. Misra, Keelan Evanini, Tan Lee","doi":"10.1109/ISCSLP49672.2021.9362102","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362102","url":null,"abstract":"Automated reading error detection has attracted a lot of interest in the area of computer-assisted language learning and auto-mated reading tutors. This paper presents preliminary experimental results on automatic detection of word-level reading errors in non-native speech. A state-of-the-art large vocabulary automatic speech recognition (ASR) system is developed to transcribe non-native speech, with performance comparable to humans in transcribing non-native read speech data. With this ASR system, we investigate the feasibility of detecting substitution, insertion and deletion errors from ASR decoding results on non-native read speech. Experimental results show that the performance of detecting substitution and insertion errors are on the low side. Several possible reasons for causing such results are discussed in this paper. Common types of reading errors occurring in non-native read speech and those that are difficult to be detected are analyzed for future investigation.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128194142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Method for Improving Generative Adversarial Networks in Speech Enhancement","authors":"Fan Yang, Junfeng Li, Yonghong Yan","doi":"10.1109/ISCSLP49672.2021.9362057","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362057","url":null,"abstract":"Recent advances in deep learning-based speech enhancement techniques have shown promising prospects over most traditional methods. Generative adversarial networks (GANs), as a recent breakthrough in deep learning, can effectively remove additive noise embedded in speech, improving the perceptual quality [1]. In the existing methods of using GANs to achieve speech enhancement, the discriminator often regards the clean speech signal as real data and the enhanced speech signal as fake data; however, this approach may cause feedback from the discriminator to fail to provide sufficient effective information for the generator to correct its output waveform. In this paper, we propose a new method to use GANs for speech enhancement. This method, by constructing a new learning target for the discriminator, allows the generator to obtain more valuable feed-back, generating more realistic speech signals. In addition, we introduce a new objective, which requires the generator to generate data that matches the statistics of the real data. Systematic evaluations and comparisons show that the proposed method yields better performance compared with state-of-art method-s, and achieves better generalization under challenging unseen noise and signal-to-noise ratio (SNR) environments.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121692215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosody and Dialogue Act: A Perceptual Study on Chinese Interrogatives","authors":"G. Huang, Ai-jun Li, Sichen Zhang, Liang Zhang","doi":"10.1109/ISCSLP49672.2021.9362061","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362061","url":null,"abstract":"Prosody conveys dialogue acts and intentions in speech interaction. This study aims at investigating the interplay between prosody and dialogue acts (pragmatic functions) for Chinese dialogues. To this end, a perceptual experiment was carried out on interrogative intonations with varied prosodic features and contexts associated with 3 dialogue acts including request for affirmation, backchannel, and elaboration. The results demonstrated that (1) the dialogue acts of the context affects the perception of the interrogatives with the same prosodic features; (2) when there is a mismatch between the actual prosody and the context, the context limits or reduces the perception gradient of interrogatives; (3) the perception of interrogative information mainly depends on prosody, while context make contributions to the perception of interrogative-declarative information by modulating listeners’ interpretation of dialogue acts performed by certain prosodic features.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128348661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meidan Ouyang, Rohan Kumar Das, Jichen Yang, Haizhou Li
{"title":"Capsule Network based End-to-end System for Detection of Replay Attacks","authors":"Meidan Ouyang, Rohan Kumar Das, Jichen Yang, Haizhou Li","doi":"10.1109/ISCSLP49672.2021.9362111","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362111","url":null,"abstract":"Automatic speaker verification systems are prone to various spoofing attacks. The convolutional neural networks are found to be effective for detection of spoofing attacks. However, they lack spatial information and relationship of low-level features with the pooling layer. On the other hand, capsule networks use vectors to record spatial information and the probability of presence simultaneously. They are known to be effective for detection of forged images and videos. In this work, we study capsule networks for replay attack detection. We consider different input features to capsule network and study on recent ASVspoof 2019 physical access corpus. The studies suggest the proposed capsule network based system performs effectively and the performance is comparable to state-of-the-art single systems for replay attack detection.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134152112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Attention-based End-to-end ASR by Incorporating an N-gram Neural Network","authors":"Junyi Ao, Tom Ko","doi":"10.1109/ISCSLP49672.2021.9362055","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362055","url":null,"abstract":"In attention-based end-to-end ASR, the intrinsic LM is modeled by an RNN and it forms the major part of the decoder. Comparing with external LMs, the intrinsic LM is considered as modest as it is only trained with the transcription associated with the speech data. Although it is a common practise to interpolate the scores of the end-to-end model and the external LM, the need of an external model hurts the novelty of end-to-end. Therefore, researchers are investigating different ways of improving the intrinsic LM of the end-to-end model. By observing the fact that N-gram LMs and RNN LMs can complement each other, we would like to investigate the effect of implementing an N-gram neural network inside the end-to-end model. In this paper, we examine two implementations of N-gram neural network in the context of attention-based end-to-end ASR. We find that both implementations improve the baseline and CBOW (Continuous Bag-of-Words) performs slightly better. We further propose a way to minimize the size of the N-gram component by utilizing the coda information of the modeling units. Experiments on LibriSpeech dataset show that our proposed method achieves obvious improvement with only a slight increase in model parameters.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132091610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAN-Based Inter-Channel Amplitude Ratio Decoding in Multi-Channel Speech Coding","authors":"Jinru Zhu, C. Bao","doi":"10.1109/ISCSLP49672.2021.9362089","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362089","url":null,"abstract":"In this paper, a multi-channel speech coding method based on down-mixing and inter-channel amplitude ratio (ICAR) decoding based on generative adversarial network (GAN) is proposed. Firstly, spatial parameter inter-channel time difference (ICTD) is extracted. In the short-time Fourier transform (STFT) domain, the amplitude of the down-mixed mono signal is obtained by adding and averaging the amplitude of the multi-channel speech signals, the phase of the down-mixed mono signal is replaced by the phase of the reference channel, the STFT of the down-mixed mono signal is obtained. Then, the inverse STFT is used to obtain the down-mixed mono signal. The amplitude ratio between multichannel speech signals and down-mixed signal (ICAR) is extracted. The down-mixed mono signal is coded by Speex codec, and ICTD is quantized by a uniform scalar quantizer. The ICAR needn’t to be encoded. The ICAR is decoded from a well-trained GAN at the decoder based on the decoded mono signal. Finally, the decoded multi-channel speech signals are recovered by using the decoded down-mixed mono signal, decoded ICTD and the decoded ICAR. The experimental results show that the proposed multi-channel speech coding method can recover multi-channel speech signals with spatial information.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132947821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Scale Model for Mandarin Tone Recognition","authors":"Linkai Peng, Wang Dai, Dengfeng Ke, Jinsong Zhang","doi":"10.1109/ISCSLP49672.2021.9362063","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362063","url":null,"abstract":"Tone plays an important role in tonal languages such as Mandarin and tone classification is an essential component of speech evaluation of Mandarin Chinese. Previous methods for tone classification rarely take into account that different tones possess different scales along both time and frequency axis. Meanwhile, tone contours are subject to many sorts of variation and therefore information from multiple scales can help models to determine the unclear boundary of tones in continuous speech. In this work, we propose a Multi-Scale model which can gather information at multiple resolutions to better capture the characteristics of tone variations effected by complex phonetic and linguistic rules. The experimental results showed that our method achieves competitive results on the Chinese National Hi-Tech Project 863 corpus with TER of 10.5%.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114099170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of Mismatched Spectral Amplitude Levels on Vowel Identification in Simulated Electric-acoustic Hearing","authors":"Changjie Pan, Fei Chen","doi":"10.1109/ISCSLP49672.2021.9362088","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362088","url":null,"abstract":"The benefits of combined electric-acoustic stimulation (EAS) in terms of better speech recognition have been well documented in the literature for patients fitted with a hearing aid and a cochlear implant, providing them low-frequency and high-frequency speech information, respectively. This work assessed the effect of mismatched spectral amplitude levels on vowel identification in simulated EAS hearing. The spectral amplitude levels of four synthetic vowels (i.e., /iy/, /eh/, /oo/ and /ah/) were modified to amplify either low-frequency (≤600 Hz) or high-frequency (>600 Hz) portion, and the EAS-processed stimuli were presented to normal-hearing listeners to identify. Results showed declined vowel identification scores in response to acoustic or electric spectral amplitude amplification, and the specific loudness pattern computed from Moore et al.’s model was found to effectively account for the variance of vowel identification scores.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125467359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rapid Word Learning of Children with Cochlear Implants: Phonological Structure and Mutual Exclusivity","authors":"Yu-Chen Hung, Tzu-Hui Lin","doi":"10.1109/ISCSLP49672.2021.9362091","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362091","url":null,"abstract":"To understand how degraded speech signals with reduced spectral-temporal information impact processes of word learning in children with CIs, the present study utilized a modified version of intermodal preferential looking paradigm to investigate their use of mutual exclusivity and its interaction with phonological structure. Sixteen Mandarin-speaking children with CIs aged from 34.3 and 48.1 months (M = 40.7) were recruited to examine their ability to fast-map two novel words with reduplicated-syllable and disyllable to its corresponding reference, respectively. Familiar objects were paired with novel ones to elicit the possible use of mutual exclusivity. Overall, no significant preferential looking towards target was found for the novel-novel object pairs. The finding indicates that, regardless of the phonological structure, it is challenging for children with CIs to fast map a new word. However, a significantly higher looking proportion was obtained for the disyllabic novel object when it is paired with a familiar one, providing evidence on the use of mutual exclusivity. Furthermore, the absent preferential effect for the novel object with reduplicated-syllable over the familiar one suggests that the phonological structure seems to modulate the effect of mutual exclusivity. Implications for phonological structure and mutual exclusivity are discussed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}