{"title":"Dialogue scenario classification based on social factors","authors":"Yuning Liu, Di Zhou, M. Unoki, J. Dang, Ai-jun Li","doi":"10.1109/ISCSLP57327.2022.10037880","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037880","url":null,"abstract":"The tendency of interlocutors to become more similar to each other in the way they speak, this behavior is known in the literature as entrainment, accommodation, or adaptation. Previous studies indicated that entrainment can be treated as a social factor in human-human conversations. However, previous research suggests that this phenomenon has many subtleties. One of these cues is that entrainment on an acoustic feature might be associated with disentrainment on another in conversation, which means we have to consider these features together. Therefore, we proposed a linear dimensionality-reduction method that combines acoustic features to calculate three entrainment metrics: proximity, convergence, and synchrony. The three entrainment metrics are referred to as social factors hereafter. Our results show these social factors play an important role in a classification task. We also found that these social factors perform a better classification accuracy than combining each individual acoustic feature’s entrainment. The proposed social factors can help the human-machine interface to have the ability to adapt to the different scenarios in dialogue.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127102890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ho-Lam Chung, Junan Li, Pengfei Liu1, Wai-Kim Leung, Xixin Wu, H. Meng
{"title":"Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition","authors":"Ho-Lam Chung, Junan Li, Pengfei Liu1, Wai-Kim Leung, Xixin Wu, H. Meng","doi":"10.1109/ISCSLP57327.2022.10038220","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038220","url":null,"abstract":"Homophone characters are common in tonal syllable-based languages, such as Mandarin and Cantonese. The data-intensive end-to-end Automatic Speech Recognition (ASR) systems are more likely to mis-recognize homophone characters and rare words under low-resource settings. For the problem of low-resource Cantonese speech recognition, this paper presents a novel homophone extension method to integrate human knowledge of the homophone lexicon into the beam search decoding process with language model re-scoring. Besides, we propose an automatic unified writing method to merge the variants of Cantonese characters and standardize speech annotation guidelines, which enables more efficient utilization of labeled utterances by providing more samples for the merged characters. We empirically show that both homophone extension and unified writing improve the recognition performance significantly on both in-domain and out-of-domain test sets, with an absolute Character Error Rate (CER) decrease of around 5% and 18%.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127375714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aphasia Detection for Cantonese-Speaking and Mandarin-Speaking Patients Using Pre-Trained Language Models","authors":"Ying Qin, Tan Lee, A. Kong, Feng Lin","doi":"10.1109/ISCSLP57327.2022.10037929","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037929","url":null,"abstract":"Automatic analysis of aphasic speech based on speech technology has been extensively investigated in recent years, but there has been a few studies on Chinese languages. In this paper, we focus on automatic aphasia detection for Cantonese-and Mandarin-speaking patients using state-of-the-art pre-trained language models that support both traditional and simplified Chinese. Given speech transcriptions of subjects, pre-trained language models are used in two ways: 1) pre-trained language model derived embeddings followed by a classifier; 2) pre-trained language model fine-tuned for aphasia detection task. Both approaches are demonstrated to outperform baseline models using acoustic features and static word embeddings. The best accuracy is obtained with fine-tuned BERT models, achieving 0.98 and 0.94 for Cantonese-speaking and Mandarin-speaking subjects respectively. We also investigate the feasibility of applying the cross-lingual pre-trained language model fine-tuned by aphasia detection task for Cantonese-speaking subjects to Mandarin-speaking subjects with limited data. The promising results will hopefully make it possible to perform detection on those low-resource pathological speech which is difficult to implement a specific detection system.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123804108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao-Ting Wang, Yi-Xing Lin, Kai-Wen Liang, Tzu-Chiang Tai, Jia-Ching Wang
{"title":"Lightweight End-To-End Deep Learning Model For Music Source Separation","authors":"Yao-Ting Wang, Yi-Xing Lin, Kai-Wen Liang, Tzu-Chiang Tai, Jia-Ching Wang","doi":"10.1109/ISCSLP57327.2022.10037912","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037912","url":null,"abstract":"In this work, we propose a lightweight end-to-end music source separation deep learning model. Deep learning models for audio source separation based on time-domain have been proposed for end-to-end processing. However, the proposed models are complex and difficult to use when the computing resources of the device are limited. Additionally, long delays may be expected since long-term inputs are required to obtain adequate results for separation, making the models unsuitable for applications that require low latency. In the proposed model, Atrous Spatial Pyramid Pooling is used to reduce the number of parameters, and the receptive field preserving decoder is utilized to enhance the result of separation while the input context length is limited. The experimental results show that the proposed method obtains better results than previous methods while using 10% or fewer parameters.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125382475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Spoken Language Teaching Tech: Combining Multi-attention and AdaIN for One-shot Cross Language Voice Conversion","authors":"Dengfeng Ke, Wenhan Yao, Ruixin Hu, Liangjie Huang, Qi Luo, Wentao Shu","doi":"10.1109/ISCSLP57327.2022.10038137","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038137","url":null,"abstract":"Computer aided pronunciation training(CAPT) plays an important role in oral language teaching. The main methods of traditional computer-assisted oral teaching include mispronunciation detection and pronunciation scoring and assessment.However, these two techniques only give negative feedback information such as scores or error categories. In this case,it is difficult for learners to refine their pronunciation through these two indicators without the guidance of correct speech.To tackle this problem, we proposed a cross language voice conversion(VC) framework that can generate speech with template speech content and learners’ own timbre,which can guide the learner’s pronunciation.To improve VC effect,we apply AdaIN in the fore-end and after the Value matrix in multi-head attention once respectively,called attention-AdaIN,which can improve the style transfer and sequence generation ability.We used attention-AdaIN to construct VC framework based on VAE.Experiments conducted on the AISHELL-3 and VCTK corpus showed that this new aprroach improved the baseline VAE-VC.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116821830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosodic Encoding of Mandarin Chinese Intonation by Uygur Speakers in Declarative and Interrogative Sentences","authors":"Tong Li, Hui Feng, Yuan Jia","doi":"10.1109/ISCSLP57327.2022.10037847","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037847","url":null,"abstract":"As a major cause of non-native accent for L2 learners, L2 intonation plays an important role in the acquisition of L2 suprasegment. Few studies have been on the prosodic encoding of Chinese intonation by Uygur learners with Mandarin Chinese as a second language (CSL). With L2 Intonation Learning theory (LILt) as the theoretical framework, this study investigates the prosodic encoding of Mandarin intonation by Uygur CSL learners and compares with Beijing Mandarin speakers. Twelve speakers were invited to produce six pairs of Mandarin declarative and interrogative intonations in different tone sequences. It is found that for Uygur CSL learners, the pitch is falling in Mandarin declarative intonation and rising in interrogative intonation, which is similar to Mandarin speakers. However, the bottom lines in two intonations both drop slower. The interactions of L1 and L2 result in the narrowing trend of tonal pitch ranges (TPRs) in declarative intonation assimilated to L1 and the expanding trend in interrogative intonation assimilated to L2.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129082766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perception and production of Mandarin vowels by teenagers–blind and sighted","authors":"Moyu Chen, Jing Qi, Xiyu Wu","doi":"10.1109/ISCSLP57327.2022.10037586","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037586","url":null,"abstract":"This study aims to explore the perception and production of Mandarin vowels by the blind and sighted. Twenty blind and twenty sighted teenagers participated in the identification and pronunciation experiment of five Mandarin vowels. We calculated the variability and contrast parameters in the first and the second formant space to investigate the articulatory and perceptual abilities of subjects. The Coefficient of Variation (CV) used to quantify vowel variability reflected the precision of perception and production. The result demonstrated that blind people had more variable perceptions, but there was no difference in production. Vowel contrast was represented by the Average Vowel Space (AVS) of /a/, /i/and/u/, reflecting the distinguishability of vowels. The AVSs of the blind were significantly smaller in perception than the sighted, but larger in production. We found that the difference in perception mainly comes from the second formant. Also, the specific vowel absence observed in blind participants in the perceptual experiment suggested that the lack of vision is likely to bring about differences in vowel perception cues. Unfortunately, we did not find any correlations between perception and production in terms of precision and distinction. Improved experiments are required to explore the relationship between perception and production.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130611355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequence Distribution Matching for Unsupervised Domain Adaptation in ASR","authors":"Qingxu Li, Hanjing Zhu, Liuping Luo, Gaofeng Cheng, Pengyuan Zhang, Jiasong Sun, Yonghong Yan","doi":"10.1109/ISCSLP57327.2022.10037857","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037857","url":null,"abstract":"Unsupervised domain adaptation (UDA) aims to improve the cross-domain model performance without labeled target domain data. Distribution matching is a widely used UDA approach for automatic speech recognition (ASR), which learns domain-invariant while class-discriminative representations. Most previous approaches to distribution matching simply treat all frames in a sequence as independent features and match them between domains. Although intuitive and effective, the neglect of the sequential property could be sub-optimal for ASR. In this work, we propose to explicitly capture and match the sequence-level statistics with sequence pooling, leading to a sequence distribution matching approach. We examined the effectiveness of the sequence pooling on the basis of the maximum mean discrepancy (MMD) based and domain adversarial training (DAT) based distribution matching approaches. Experimental results demonstrated that the sequence pooling methods effectively boost the performance of distribution matching, especially for the MMD-based approach. By combining sequence pooling features and original features, MMD-based and DAT-based approaches relatively reduce WER by 12.08% and 14.72% over the source domain model.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132421477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Study on Mandarin Chinese “Bu” Tone Sandhi Followed by English Words","authors":"Kaige Gao, Xiyu Wu","doi":"10.1109/ISCSLP57327.2022.10037923","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037923","url":null,"abstract":"This paper focuses on tone sandhi of the Chinese word “Bu” followed by English words. In Mandarin Chinese, the tone of the word “Bu” is tone 4 (the high-falling tone), and its tone sandhi rule is that it becomes tone 2 (the mid-rising tone) when followed by another tone 4 syllable. However, few researchers have paid attention to the tone sandhi rules of “Bu” when followed by English words. In this study, the tone sandhi rule of “Bu” when followed by English words is explored by a perceptual experiment and a pitch contour analysis. Results show that the Mandarin Chinese “Bu” tone sandhi also applies when “Bu” is followed by an English word. When followed by a high-falling monosyllabic English word, “Bu” will become tone 2. Furthermore, the pitch contour of English words embedded in Mandarin sentences is predictable according to their syllabic structure and lexical stress. This paper further proves that native Chinese speakers perceive English words inserted into Chinese speech as tonal-like language. The tonal patterns of English words summarized in this study can provide theoretical support for improving the naturalness of mixed-language speech synthesis.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132546435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconstruction of speech spectrogram based on non-invasive EEG signal","authors":"Di Zhou, M. Unoki, Gaoyan Zhang, J. Dang","doi":"10.1109/ISCSLP57327.2022.10038234","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038234","url":null,"abstract":"Decoding neural activity into speech could enable natural conversations for people who are unable to communicate as a result of neurological diseases. Studies have proven that speech could be directly recognized or synthesized from intracranial recordings. However, intracranial electrocorticography is invasive, thus not comfortable for patients. By the acoustic representation of speech in the high-level brain cortex, we successfully reconstructed a speech spectrogram from non-invasive electroencephalography (EEG), which has similar accuracy to previous intracranial recording. As well as the reported superior temporal gyrus, premotor cortex, and inferior frontal gyrus, we also found speech representations in several other cortices such as an entorhinal, fusiform, and temporal pole. The intelligibility of the recovered speech in this study was not high enough, however, our findings show a possibility to reconstruct speech from non-invasive EEG in the future.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133757502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}