2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)最新文献_第4页

Speech Emotion Recognition Based on Acoustic Segment Model 基于声段模型的语音情感识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362119

Siyuan Zheng, Jun Du, Hengshun Zhou, Xue Bai, Chin-Hui Lee, Shipeng Li

引用次数: 3

An Attention-augmented Fully Convolutional Neural Network for Monaural Speech Enhancement 用于单耳语音增强的注意增强全卷积神经网络

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362114

Zezheng Xu, Ting Jiang, Chao Li, JiaCheng Yu

引用次数: 2

UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition 基于UNet++的多通道语音去噪与远程语音识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362064

Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han

{"title":"UNet++-Based Multi-Channel Speech Dereverberation and Distant Speech Recognition","authors":"Tuo Zhao, Yunxin Zhao, Shaojun Wang, Mei Han","doi":"10.1109/ISCSLP49672.2021.9362064","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362064","url":null,"abstract":"We propose a novel approach of using a newly appeared fully convolutional network (FCN) architecture, UNet++, for multichannel speech dereverberation and distant speech recognition (DSR). While the previous FCN architecture UNet is good at utilizing time-frequency structures of speech, UNet++ offers better robustness in network depths and skip connections. For DSR, UNet++ serves as a feature enhancement front-end, and the enhanced speech features are used for acoustic model training and recognition. We also propose a frequency-dependent convolution scheme (FDCS), resulting in new variants of UNet and UNet++. We present DSR results from the multiple distant microphone (MDM) datasets of AMI meeting corpus, and compare the performance of UNet++ with UNet and weighted prediction error (WPE). Our results demonstrate that for DSR, the UNet++-based approaches provide large word error rate (WER) reductions over its UNetand WPE-based counterparts. The UNet++ with WPE preprocessing and 4-channel input achieves the lowest WERs. The dereverberation results are also measured by speech-to-dereverberation modulation energy ratio (SRMR), from which large gains of UNet++ over UNet and WPE are also observed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116828081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers 任意说话者的非并行序列到序列语音转换

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362095

Ying Zhang, Hao Che, Xiaorui Wang

{"title":"Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers","authors":"Ying Zhang, Hao Che, Xiaorui Wang","doi":"10.1109/ISCSLP49672.2021.9362095","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362095","url":null,"abstract":"Voice conversion (VC) aims to modify the speaker’s tone while preserving the linguistic information. Recent works show that voice conversion has made great progress on non-parallel data by introducing phonetic posteriorgrams (PPGs). However, once the prosody of source and target speaker differ significantly, it causes noticeable quality degradation of the converted speech. To alleviate the impact of the prosody of the source speaker, we propose a sequence-to-sequence voice conversion (Seq2SeqVC) method, which utilizes connectionist temporal classification PPGs (CTC-PPGs) as inputs and models the non-linear length mapping between CTC-PPGs and frame-level acoustic features. CTC-PPGs are extracted by the CTC based automatic speech recognition (CTC-ASR) model and used to replace time-aligned PPGs. The blank token is introduced in CTC-ASR outputs to identify fewer information frames and get around consecutive repeating characters. After removing blank tokens, the left CTC-PPGs only contain linguistic information, and the phone duration information of the source speech is removed. Thus, phone durations of the converted speech are more faithful to the target speaker, which means higher similarity to the target and less interference from different source speakers. Experimental results show our Seq2Seq-VC model achieves higher scores in similarity and naturalness tests than the baseline method. What’s more, we expand our seq2seqVC approach to voice conversion towards arbitrary speakers with limited data. The experimental results demonstrate that our Seq2Seq-VC model can transfer to a new speaker using 100 utterances (about 5 minutes).","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130594843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Comparing the Rhythm of Instrumental Music and Vocal Music in Mandarin and English 国语与英文器乐与声乐节奏之比较

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362066

Lujia Yang, Hongwei Ding

引用次数: 2

A Model Ensemble Approach for Sound Event Localization and Detection 声音事件定位与检测的模型集成方法

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362116

Qing Wang, Huaxin Wu, Zijun Jing, Feng Ma, Yi Fang, Yuxuan Wang, Tairan Chen, Jia Pan, Jun Du, Chin-Hui Lee

{"title":"A Model Ensemble Approach for Sound Event Localization and Detection","authors":"Qing Wang, Huaxin Wu, Zijun Jing, Feng Ma, Yi Fang, Yuxuan Wang, Tairan Chen, Jia Pan, Jun Du, Chin-Hui Lee","doi":"10.1109/ISCSLP49672.2021.9362116","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362116","url":null,"abstract":"In this paper, we propose a model ensemble approach for sound event localization and detection (SELD). We adopt several deep neural network (DNN) architectures to perform sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously. Generally, the DNN architecture consists of three modules stacked together, i.e, a High-level Feature Representation module, a Temporal Context Representation module, and a Fully-connected module in the end. The High-level Feature Representation module usually contains a series of convolutional neural network (CNN) layers to extract useful local features. The Temporal Context Representation module aims to model longer temporal context dependency in the extracted features. There are two parallel branches in the Fully-connected module with one for SED estimation and the other for DOA estimation. With different combinations of implementation in the High-level Feature Representation module and Temporal Context Representation module, several network architectures are used for the SELD task. At last, a more robust prediction of SED and DOA is obtained by model ensemble and post-processing. Tested on the development and evaluation datasets, the proposed approach achieves promising results and ranks the first place in DCASE 2020 task3 challenge. Index Terms: sound event localization and detection, deep neural network, model ensemble","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126101019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Complex Patterns of Tonal Realization in Taifeng Chinese 台丰汉语声调实现的复杂模式

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362074

Xiaoyan Zhang, Ai-jun Li, Zhiqiang Li

{"title":"Complex Patterns of Tonal Realization in Taifeng Chinese","authors":"Xiaoyan Zhang, Ai-jun Li, Zhiqiang Li","doi":"10.1109/ISCSLP49672.2021.9362074","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362074","url":null,"abstract":"Taifeng Chinese is a Wu dialect that has the smallest inventory of tones while still preserving checked tones and the voicing contrast in syllable onsets. A previous acoustic study identified four surface tones in isolation as a result of tone split and the merger of tonal categories derived from the Middle Chinese tonal system. Notably, a long Yang Shang tone merged with short checked tones and the Yang Ping tone was realized as two surface tones, subject to regional and age-graded variations. Surface realization of tones in Taifeng is further examined in an acoustic investigation of disyllabic tone sandhi in verb-object (VO) combinations. Analyses of pitch contours and tonal duration from multiple speakers reveal complex patterns of tonal realization in tone sandhi. The tone sandhi in VO combinations is best characterized as the right-dominant pattern, in which the second tone has consistently longer duration and retains its citation form while the first tone is realized in a reduced pitch range with much shorter duration. Tonal realization is also governed by register alternation: the two tones are realized in opposite register specifications. In general, tonal realization in tone sandhi exhibits considerable complexities not observed in monosyllables.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116338741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Speaker Embedding Augmentation with Noise Distribution Matching 基于噪声分布匹配的说话人嵌入增强

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362090

Xun Gong, Zhengyang Chen, Yexin Yang, Shuai Wang, Lan Wang, Y. Qian

引用次数: 4

A Practical Way to Improve Automatic Phonetic Segmentation Performance 一种提高自动语音切分性能的实用方法

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362107

Wenjie Peng, Yingming Gao, Binghuai Lin, Jinsong Zhang

{"title":"A Practical Way to Improve Automatic Phonetic Segmentation Performance","authors":"Wenjie Peng, Yingming Gao, Binghuai Lin, Jinsong Zhang","doi":"10.1109/ISCSLP49672.2021.9362107","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362107","url":null,"abstract":"Automatic phonetic segmentation is a fundamental task for many applications. Segmentation systems highly rely upon the acoustic-phonetic relationship. However, the phonemes’ realization varies in continuous speech. As a consequence, segmentation systems usually suffer from such variation, which includes the intra-phone dissimilarity and the inter-phone similarity in terms of acoustic properties. In this paper, We conducted experiments following the classic GMM-HMM framework to address these issues. In the baseline setup, we found the top error comes from diphthong /oy/ and boundary of glide-to-vowel respectively, which suggested the influence of the above variation on segmentation results. Here, we present our approaches to improve automatic phonetic segmentation performance. First, we modeled the intra-phone dissimilarity using GMM with model selection at the state-level. Second, we utilized the context-dependent models to handle the inter-phone similarity due to coarticulation effect. The two approaches are coupled with the objective to improve segmentation accuracy. Experimental results demonstrated the effectiveness for the aforementioned top error. In addition, we also took the phones’ duration into account for the HMM topology design. The segmentation accuracy was further improved to 91.32% within 20ms on the TIMIT corpus after combining the above refinements, which has a relative error reduction of 3.34% compared to the raw GMM-HMM segmentation in [1].","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115600041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Experimental Research on Tonal Errors in Monosyllables of Standard Spoken Chinese Language Produced by Uyghur Learners 维吾尔族标准汉语口语单音节声调错误的实验研究

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362101

Qiuyuan Li, Yuan Jia

{"title":"An Experimental Research on Tonal Errors in Monosyllables of Standard Spoken Chinese Language Produced by Uyghur Learners","authors":"Qiuyuan Li, Yuan Jia","doi":"10.1109/ISCSLP49672.2021.9362101","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362101","url":null,"abstract":"This article uses phonetic experiments and quantitative statistics to investigate the lexical tone production by Uyghur learners of Standard Spoken Chinese Language (hereinafter referred to as SSCL) and compares the performance of Elementary, Advanced, and Native SSCL speakers, then uses micro-analysis of the nature, types, and acoustic performance characteristics of the errors that occur, to help us further understand these errors accurately and clearly. It is found that when Uyghur Chinese learners produce SSCL tones, their biggest problem is that the difference between Tone 2 and Tone 3 is not as obvious as that of Native SSCL speakers. This finding agrees with previous studies on Xinjiang students’ perception of SSCL lexical tones, which found that these 2 tones are often mistaken for one another. This study also finds that the tonal space of Elementary speakers is not as wide as Advanced and Native speakers’, and the minimum F0 value in Uyghur speakers locates in Tone 3 rather than Tone 4; the reasons behind these findings need to be studied later.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129533522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0