2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)最新文献_第10页

Summary On The ISCSLP 2022 Chinese-English Code-Switching ASR Challenge ISCSLP 2022汉英码交换ASR挑战赛综述

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-10-12 DOI: 10.1109/ISCSLP57327.2022.10038051

Shuhao Deng, Chengfei Li, Jinfeng Bai, Qingqing Zhang, Weiqiang Zhang, Runyan Yang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

{"title":"Summary On The ISCSLP 2022 Chinese-English Code-Switching ASR Challenge","authors":"Shuhao Deng, Chengfei Li, Jinfeng Bai, Qingqing Zhang, Weiqiang Zhang, Runyan Yang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ISCSLP57327.2022.10038051","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038051","url":null,"abstract":"Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the development of code-switching automatic speech recognition. The ISCSLP 2022 CSASR challenge provided two training sets, TAL_CSASR corpus and MagicData-RAMC corpus, a development and a test set for participants, which are used for CSASR model training and evaluation. Along with the challenge, we also provide the baseline system performance for reference. As a result, more than 40 teams participated in this challenge, and the winner team achieved 16.70% Mixture Error Rate (MER) performance on the test set and has achieved 9.8% MER absolute improvement compared with the baseline system. In this paper, we will describe the datasets, the associated baselines system and the requirements, and summarize the CSASR challenge results and major techniques and tricks used in the submitted systems.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125329518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition 基于泊松子采样的师生集成学习差分隐私语音识别方法

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-10-12 DOI: 10.1109/ISCSLP57327.2022.10038060

C. Yang, Jun Qi, S. Siniscalchi, Chin-Hui Lee

{"title":"An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition","authors":"C. Yang, Jun Qi, S. Siniscalchi, Chin-Hui Lee","doi":"10.1109/ISCSLP57327.2022.10038060","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038060","url":null,"abstract":"We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework that introduces DP-preserving noise at the output of the teacher models and transfers DP-preserving properties via noisy labels. Privacy-preserving student models are then trained with the noisy labels to learn the knowledge with DP-protection from the teacher model ensemble. Experimental evidences on spoken command recognition and continuous speech recognition of Mandarin speech show that our proposed framework greatly outperforms existing state-of-the-art DP-preserving algorithms in both ASR tasks.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"2 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114033589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture 基于se - res2net的合成语音检测和音频拼接检测

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-10-07 DOI: 10.1109/ISCSLP57327.2022.10037999

Lei Wang, Benedict Yeoh, Jun Wah Ng

引用次数: 3

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines 会话短短语说话者分类(CSSD)任务:数据集、评估指标和基线

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-08-17 DOI: 10.1109/ISCSLP57327.2022.10038258

Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxu Li, Zehui Yang, Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Linfu Xie, Y. Qian, Kong-Aik Lee, Yonghong Yan

{"title":"The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines","authors":"Gaofeng Cheng, Yifan Chen, Runyan Yang, Qingxu Li, Zehui Yang, Lingxuan Ye, Pengyuan Zhang, Qingqing Zhang, Linfu Xie, Y. Qian, Kong-Aik Lee, Yonghong Yan","doi":"10.1109/ISCSLP57327.2022.10038258","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038258","url":null,"abstract":"The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of ”who speak when” as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

End-to-End Voice Conversion with Information Perturbation 基于信息扰动的端到端语音转换

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-06-15 DOI: 10.1109/ISCSLP57327.2022.10037890

Qicong Xie, Shan Yang, Yinjiao Lei, Linfu Xie, Dan Su

{"title":"End-to-End Voice Conversion with Information Perturbation","authors":"Qicong Xie, Shan Yang, Yinjiao Lei, Linfu Xie, Dan Su","doi":"10.1109/ISCSLP57327.2022.10037890","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037890","url":null,"abstract":"The ideal goal of voice conversion is to convert the source speaker’s speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129532078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation 优点:小VITS低计算资源说话者适应

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-06-01 DOI: 10.1109/ISCSLP57327.2022.10037585

Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Linfu Xie, Bing Yang, Xiong Zhang, Dan Su

{"title":"AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation","authors":"Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Linfu Xie, Bing Yang, Xiong Zhang, Dan Su","doi":"10.1109/ISCSLP57327.2022.10037585","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037585","url":null,"abstract":"Speaker adaptation in text-to-speech synthesis (TTS) is to fine-tune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny VITS-based [1] TTS model, named AdaVITS, for low computing resource speaker adaptation is proposed. To effectively reduce the parameters and computational complexity of VITS, an inverse short-time Fourier transform (iSTFT)-based wave construction decoder is proposed to replace the upsampling-based decoder which is resource-consuming in the original VITS. Besides, NanoFlow is introduced to share the density estimate across flow blocks to reduce the parameters of the prior encoder. Furthermore, to reduce the computational complexity of the textual encoder, scaled-dot attention is replaced with linear attention. To deal with the instability caused by the simplified model, we use phonetic posteriorgram (PPG) as a frame-level linguistic feature for supervising the model process from phoneme to spectrum. Experiments show that AdaVITS can generate stable and natural speech in speaker adaptation with 8. 97M model parameters and 0.72 GFlops computational complexity.1","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"62 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130030946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition 端到端普通话语音识别的多级建模单元

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-05-24 DOI: 10.1109/ISCSLP57327.2022.10037884

Yuting Yang, Binbin Du, Yuke Li

{"title":"Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition","authors":"Yuting Yang, Binbin Du, Yuke Li","doi":"10.1109/ISCSLP57327.2022.10037884","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10037884","url":null,"abstract":"The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units and the decoder block deals with character-level modeling units. To facilitate the incremental conversion from syllable features to character features, we design an auxiliary task that applies cross-entropy (CE) loss to intermediate decoder layers. During inference, the input feature sequences are converted into syllable sequences by the encoder block and then converted into Chinese characters by the decoder block. Experiments on the widely used AISHELL-I [1] corpus demonstrate that our method achieves promising results with CER of 4.1%/4.6% and 4.6%/5.2%, using the Conformer and the Transformer backbones respectively.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114671669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction CorrectSpeech:一个完全自动化的语音纠正和口音减少系统

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-04-12 DOI: 10.1109/ISCSLP57327.2022.10038107

Daxin Tan, Liqun Deng, Nianzu Zheng, Y. Yeung, Xin Jiang, Xiao Chen, Tan Lee

{"title":"CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction","authors":"Daxin Tan, Liqun Deng, Nianzu Zheng, Y. Yeung, Xin Jiang, Xiao Chen, Tan Lee","doi":"10.1109/ISCSLP57327.2022.10038107","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038107","url":null,"abstract":"This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped symbol sequence, aligning recognized symbol sequence with target text to determine locations and types of required edit operations, and generating the corrected speech. Experiments show that the quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules, as well as the granularity level of editing operations. The proposed system is evaluated on two corpora: a manually perturbed version of VCTK and L2-ARCTIC. The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings. Audio samples are available online for demonstration1.1https://daxintan-cuhk.github.io/CorrectSpeech/","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"601 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123191116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition 3M:语音识别的多损失、多路径和多层次神经网络

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-04-07 DOI: 10.1109/ISCSLP57327.2022.10037818

Zhao You, Shulin Feng, Dan Su, Dong Yu

引用次数: 5

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study 利用单通道语音进行多通道端到端语音识别:比较研究

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2022-03-31 DOI: 10.1109/ISCSLP57327.2022.10038153

Keyu An, Zhijian Ou

{"title":"Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study","authors":"Keyu An, Zhijian Ou","doi":"10.1109/ISCSLP57327.2022.10038153","DOIUrl":"https://doi.org/10.1109/ISCSLP57327.2022.10038153","url":null,"abstract":"Recently, the end-to-end training approach for multi-channel ASR has shown its effectiveness, which usually consists of a beamforming front-end and a recognition back-end. However, the end-to-end training becomes more difficult due to the integration of multiple modules, particularly considering that multichannel speech data recorded in real environments are limited in size. This raises the demand to exploit the single-channel data for multi-channel end-to-end ASR. In this paper, we systematically compare the performance of three schemes to exploit external single-channel data for multi-channel end-to-end ASR, namely back-end pre-training, data scheduling, and data simulation, under different settings such as the sizes of the single-channel data and the choices of the front-end. Extensive experiments on CHiME-4 and AISHELL-4 datasets demonstrate that while all three methods improve the multi-channel end-to-end speech recognition performance, data simulation outperforms the other two, at the cost of longer training time. Data scheduling outperforms back-end pre-training marginally but nearly consistently, presumably because in the pre-training stage, the back-end tends to overfit on the single-channel data, especially when the single-channel data size is small.","PeriodicalId":246698,"journal":{"name":"2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130683388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1