Siyuan Zheng, Jun Du, Hengshun Zhou, Xue Bai, Chin-Hui Lee, Shipeng Li
{"title":"Speech Emotion Recognition Based on Acoustic Segment Model","authors":"Siyuan Zheng, Jun Du, Hengshun Zhou, Xue Bai, Chin-Hui Lee, Shipeng Li","doi":"10.1109/ISCSLP49672.2021.9362119","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362119","url":null,"abstract":"Accurate detection of emotion from speech is a challenging task due to the variability in speech and emotion. In this paper, we propose a speech emotion recognition (SER) method based on acoustic segment model (ASM) to deal with this issue. Specifically, speech with different emotions is segmented more finely by ASM. Each of these acoustic segments is modeled by Hidden Markov Models (HMMs) and decoded into a series of ASM sequences in an unsupervised way. Then feature vectors are obtained from these sequences above by latent semantic analysis (LSA). Finally, these feature vectors are fed to a classifier. Validated on the IEMOCAP corpus, results demonstrate the proposed method outperforms the state-of-the-art methods with a weighted accuracy of 73.9% and an unweighted accuracy of 70.8% respectively.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Attention-augmented Fully Convolutional Neural Network for Monaural Speech Enhancement","authors":"Zezheng Xu, Ting Jiang, Chao Li, JiaCheng Yu","doi":"10.1109/ISCSLP49672.2021.9362114","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362114","url":null,"abstract":"Convolutional neural networks (CNN) have made remarkable achievements in speech enhancement. However, the convolution operation is difficult to obtain the global context of the feature map due to its locality. To solve the above problem, we propose an attention-augmented fully convolutional neural network for monaural speech enhancement. More specifically, the method is to integrate a new two-dimensional relative selfattention mechanism into fully convolutional networks. Besides, we utilize Huber Loss as the loss function, which is more robust to noise. Experimental results indicate that compared with the optimally modified log-spectral amplitude (OMLSA) estimator and other CNN-based models, our proposed network has better performance in five indicators, and can well balance noise suppression and speech distortion. What is more, we also embed the proposed attention mechanism into other convolutional networks and get satisfactory results, showing that this mechanism has great generalization ability.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127125032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition","authors":"Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han","doi":"10.1109/ISCSLP49672.2021.9362064","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362064","url":null,"abstract":"We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116828081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers","authors":"Ying Zhang, Hao Che, Xiaorui Wang","doi":"10.1109/ISCSLP49672.2021.9362095","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362095","url":null,"abstract":"Voice conversion (VC) aims to modify the speaker’s tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2SeqVC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What’s more, we expand our seq2seqVC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130594843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing the Rhythm of Instrumental Music and Vocal Music in Mandarin and English","authors":"Lujia Yang, Hongwei Ding","doi":"10.1109/ISCSLP49672.2021.9362066","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362066","url":null,"abstract":"This paper reports a study comparing the rhythm of instrumental music and vocal music in both the tonal language Mandarin Chinese and the non-tonal language British English. The widely accepted normalized pairwise variability index (nPVI) was adopted to measure the rhythm of language and music in these two cultures. Current findings validate that instrumental music in both cultures reflects the rhythmic characteristics of their corresponding languages. The rhythmic contrast of Chinese instrumental music is much lower than that of the British instrumental music. When language becomes part of the music, however, the rhythmic contrasts in Chinese vocal music is unexpectedly more variable than that in the British vocal music. Nevertheless, despite the high rhythmic contrasts in Chinese vocal music, the prominently lower rhythmic contrasts in Chinese children's songs compared to Chinese folk songs confirms the universality of the increased regularity in the rhythm of children’s songs, which demonstrates the impact of language habit on music rhythm.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132232416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qing Wang, Huaxin Wu, Zijun Jing, Feng Ma, Yi Fang, Yuxuan Wang, Tairan Chen, Jia Pan, Jun Du, Chin-Hui Lee
{"title":"A Model Ensemble Approach for Sound Event Localization and Detection","authors":"Qing Wang, Huaxin Wu, Zijun Jing, Feng Ma, Yi Fang, Yuxuan Wang, Tairan Chen, Jia Pan, Jun Du, Chin-Hui Lee","doi":"10.1109/ISCSLP49672.2021.9362116","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362116","url":null,"abstract":"In this paper, we propose a model ensemble approach for sound event localization and detection (SELD). We adopt several deep neural network (DNN) architectures to perform sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously. Generally, the DNN architecture consists of three modules stacked together, i.e, a High-level Feature Representation module, a Temporal Context Representation module, and a Fully-connected module in the end. The High-level Feature Representation module usually contains a series of convolutional neural network (CNN) layers to extract useful local features. The Temporal Context Representation module aims to model longer temporal context dependency in the extracted features. There are two parallel branches in the Fully-connected module with one for SED estimation and the other for DOA estimation. With different combinations of implementation in the High-level Feature Representation module and Temporal Context Representation module, several network architectures are used for the SELD task. At last, a more robust prediction of SED and DOA is obtained by model ensemble and post-processing. Tested on the development and evaluation datasets, the proposed approach achieves promising results and ranks the first place in DCASE 2020 task3 challenge. Index Terms: sound event localization and detection, deep neural network, model ensemble","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126101019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complex Patterns of Tonal Realization in Taifeng Chinese","authors":"Xiaoyan Zhang, Ai-jun Li, Zhiqiang Li","doi":"10.1109/ISCSLP49672.2021.9362074","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362074","url":null,"abstract":"Taifeng Chinese is a Wu dialect that has the smallest inventory of tones while still preserving checked tones and the voicing contrast in syllable onsets. A previous acoustic study identified four surface tones in isolation as a result of tone split and the merger of tonal categories derived from the Middle Chinese tonal system. Notably, a long Yang Shang tone merged with short checked tones and the Yang Ping tone was realized as two surface tones, subject to regional and age-graded variations. Surface realization of tones in Taifeng is further examined in an acoustic investigation of disyllabic tone sandhi in verb-object (VO) combinations. Analyses of pitch contours and tonal duration from multiple speakers reveal complex patterns of tonal realization in tone sandhi. The tone sandhi in VO combinations is best characterized as the right-dominant pattern, in which the second tone has consistently longer duration and retains its citation form while the first tone is realized in a reduced pitch range with much shorter duration. Tonal realization is also governed by register alternation: the two tones are realized in opposite register specifications. In general, tonal realization in tone sandhi exhibits considerable complexities not observed in monosyllables.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116338741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xun Gong, Zhengyang Chen, Yexin Yang, Shuai Wang, Lan Wang, Y. Qian
{"title":"Speaker Embedding Augmentation with Noise Distribution Matching","authors":"Xun Gong, Zhengyang Chen, Yexin Yang, Shuai Wang, Lan Wang, Y. Qian","doi":"10.1109/ISCSLP49672.2021.9362090","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362090","url":null,"abstract":"Data augmentation (DA) is an effective strategy to help building robust systems with good generalization ability. In the embedding based speaker verification, data augmentation could be applied to either the front-end embedding extractor or the back-end PLDA. Unlike the conventional back-end augmentation method which adds noises to the raw audios and then extracts augmented embeddings, in this work, we proposed a noise distribution matching (NDM) based algorithm in the speaker embedding space. The basic idea is to use distributions such as Gaussian to model the difference between the clean and original augmented noisy speaker embeddings. Experiments are carried out on SRE16 dataset, where consistent performance improvement could be obtained by the novel NDM. Furthermore, we found that the proposed NDM could be robustly estimated using only a small amount of training data, which saves time and disk cost compared to the conventional augmentation method.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116673509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Practical Way to Improve Automatic Phonetic Segmentation Performance","authors":"Wenjie Peng, Yingming Gao, Binghuai Lin, Jinsong Zhang","doi":"10.1109/ISCSLP49672.2021.9362107","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362107","url":null,"abstract":"Automatic phonetic segmentation is a fundamental task for many applications. Segmentation systems highly rely upon the acoustic-phonetic relationship. However, the phonemes’ realization varies in continuous speech. As a consequence, segmentation systems usually suffer from such variation, which includes the intra-phone dissimilarity and the inter-phone similarity in terms of acoustic properties. In this paper, We conducted experiments following the classic GMM-HMM framework to address these issues. In the baseline setup, we found the top error comes from diphthong /oy/ and boundary of glide-to-vowel respectively, which suggested the influence of the above variation on segmentation results. Here, we present our approaches to improve automatic phonetic segmentation performance. First, we modeled the intra-phone dissimilarity using GMM with model selection at the state-level. Second, we utilized the context-dependent models to handle the inter-phone similarity due to coarticulation effect. The two approaches are coupled with the objective to improve segmentation accuracy. Experimental results demonstrated the effectiveness for the aforementioned top error. In addition, we also took the phones’ duration into account for the HMM topology design. The segmentation accuracy was further improved to 91.32% within 20ms on the TIMIT corpus after combining the above refinements, which has a relative error reduction of 3.34% compared to the raw GMM-HMM segmentation in [1].","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115600041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Experimental Research on Tonal Errors in Monosyllables of Standard Spoken Chinese Language Produced by Uyghur Learners","authors":"Qiuyuan Li, Yuan Jia","doi":"10.1109/ISCSLP49672.2021.9362101","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362101","url":null,"abstract":"This article uses phonetic experiments and quantitative statistics to investigate the lexical tone production by Uyghur learners of Standard Spoken Chinese Language (hereinafter referred to as SSCL) and compares the performance of Elementary, Advanced, and Native SSCL speakers, then uses micro-analysis of the nature, types, and acoustic performance characteristics of the errors that occur, to help us further understand these errors accurately and clearly. It is found that when Uyghur Chinese learners produce SSCL tones, their biggest problem is that the difference between Tone 2 and Tone 3 is not as obvious as that of Native SSCL speakers. This finding agrees with previous studies on Xinjiang students’ perception of SSCL lexical tones, which found that these 2 tones are often mistaken for one another. This study also finds that the tonal space of Elementary speakers is not as wide as Advanced and Native speakers’, and the minimum F0 value in Uyghur speakers locates in Tone 3 rather than Tone 4; the reasons behind these findings need to be studied later.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129533522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}