Mixed Language Separation Using Deep Neural Network

2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT) Pub Date : 2021-12-10 DOI:10.1109/ICEECCOT52851.2021.9707959

Snehit Chunarkar, S. R. Chiluveru, M. Tripathy

{"title":"Mixed Language Separation Using Deep Neural Network","authors":"Snehit Chunarkar, S. R. Chiluveru, M. Tripathy","doi":"10.1109/ICEECCOT52851.2021.9707959","DOIUrl":null,"url":null,"abstract":"With multiple languages spoken in the world by different groups of people, we may encounter mixed language speech to hear, especially while vlogging in a different country or during interviews with voice dubbing. The appropriate language speech audio can be extracted from a mixed one using a separation mechanism. This paper proposes a DNN model to perform such a language separation task. Different features like Mel Frequency Cepstrum Coefficient (MFCC), Power Spectrum, and Relative Spectral Transformed Perceptual Linear Prediction coefficient (RASTA-PLP) are extracted from the mixed language speech as the input to the DNN. For the training target, the Short-Time Fourier Transform (STFT) Spectral Mask is considered. To understand the improvement on the speech, the processed speech is then evaluated for its intelligibility and quality. Here Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores are used to compare the Intelligibility and Quality of the separated language speech signal processed by the DNN. It can be observed from the results that the language separated audio using a trained DNN model has shown improved Intelligibility and Quality.","PeriodicalId":324627,"journal":{"name":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEECCOT52851.2021.9707959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With multiple languages spoken in the world by different groups of people, we may encounter mixed language speech to hear, especially while vlogging in a different country or during interviews with voice dubbing. The appropriate language speech audio can be extracted from a mixed one using a separation mechanism. This paper proposes a DNN model to perform such a language separation task. Different features like Mel Frequency Cepstrum Coefficient (MFCC), Power Spectrum, and Relative Spectral Transformed Perceptual Linear Prediction coefficient (RASTA-PLP) are extracted from the mixed language speech as the input to the DNN. For the training target, the Short-Time Fourier Transform (STFT) Spectral Mask is considered. To understand the improvement on the speech, the processed speech is then evaluated for its intelligibility and quality. Here Short-time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) scores are used to compare the Intelligibility and Quality of the separated language speech signal processed by the DNN. It can be observed from the results that the language separated audio using a trained DNN model has shown improved Intelligibility and Quality.

查看原文本刊更多论文

基于深度神经网络的混合语言分离

由于世界上不同的人群使用多种语言，我们可能会听到混合语言的声音，特别是在不同的国家录制视频或进行配音采访时。使用分离机制可以从混合语音中提取适当的语言语音音频。本文提出了一个DNN模型来执行这种语言分离任务。从混合语言语音中提取Mel频率倒谱系数(MFCC)、功率谱和相对谱变换感知线性预测系数(RASTA-PLP)等不同特征作为深度神经网络的输入。对于训练目标，考虑短时傅里叶变换(STFT)频谱掩模。为了了解语音的改善，然后对处理后的语音的可理解性和质量进行评估。本文使用短时客观可理解性(STOI)和语音质量感知评价(PESQ)分数来比较DNN处理的分离语言语音信号的可理解性和质量。从结果中可以观察到，使用训练好的DNN模型进行语言分离的音频显示出更高的可理解性和质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)

自引率

0.00%

发文量