Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-04-18 DOI:10.1109/ACCESS.2025.3562343

Swati Soni;Lalita Gupta;Rishav Dubey

{"title":"Parallel Local and Global Context Modeling of Deep Learning-Based Monaural Speech Source Separation Techniques","authors":"Swati Soni;Lalita Gupta;Rishav Dubey","doi":"10.1109/ACCESS.2025.3562343","DOIUrl":null,"url":null,"abstract":"The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"68607-68621"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10969763","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10969763/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The novel deep learning-based time domain single channel speech source separation methods have shown remarkable progress. Recent studies achieve either successful global or local context modeling for monaural speaker separation. Existing CNN-based methods perform local context modeling, and RNN-based or attention-based methods work on the global context of the speech signal. In this paper, we proposed two models which parallelly combine CNN-RNN-based and CNN-attention-based separation modules and perform parallel local and global context modeling. Our models keep maximum global or local context value at a particular time step. These values help our models to separate the speaker signals more accurately. We have conducted the experiments on Libri2mix and Libri3mix datasets. The experimental data demonstrates that our proposed models have outperformed the state-of-the-art methods. Our proposed models remarkably improve SDR and SI-SDR values on Libri2mix and Libri3mix datasets. The proposed parallel CNN-RNN-based and CNN-attention-based separation models achieve average SDR improvement of 2.10 dB and 2.21 dB, respectively, and SI-SDR improvement of 2.74 dB and 2.78 dB, respectively, on the Libri2mix dataset. However, on the Libri3mix dataset, the proposed models achieve 0.57 dB and 0.87 dB average SDR improvement for parallel CNN-RNN-based separation module, and 0.88 dB and 1.4 dB average SI-SDR improvement for CNN-attention-based separation models. Our work indirectly contributes to SDG Goal 10 (Reduced Inequalities) by improving communication tools for diverse linguistic communities. Furthermore, this technology aids SDG Goal 9 (Industry, Innovation, and Infrastructure) by advancing AI-powered assistive technologies, fostering innovation, and building resilient communication systems.

查看原文本刊更多论文

基于深度学习的单声源分离技术的并行局部和全局上下文建模

新的基于深度学习的时域单通道语音源分离方法取得了显著进展。最近的研究成功实现了单语说话人分离的全局或局部语境建模。现有的基于cnn的方法进行局部上下文建模，而基于rnn或基于注意力的方法在语音信号的全局上下文上工作。本文提出了两种模型，将基于cnn - rnn的分离模块和基于cnn -注意力的分离模块并行结合，进行局部和全局上下文并行建模。我们的模型在特定的时间步保持最大的全局或局部上下文值。这些值有助于我们的模型更准确地分离扬声器信号。我们在Libri2mix和Libri3mix数据集上进行了实验。实验数据表明，我们提出的模型优于最先进的方法。我们提出的模型显著提高了Libri2mix和Libri3mix数据集上的SDR和SI-SDR值。在Libri2mix数据集上，基于cnn - rnn和cnn -注意力的并行分离模型分别实现了2.10 dB和2.21 dB的平均SDR改进，SI-SDR分别实现了2.74 dB和2.78 dB的平均SDR改进。然而，在Libri3mix数据集上，该模型对基于并行cnn - rnn的分离模块实现了0.57 dB和0.87 dB的平均SDR改进，对基于cnn -注意力的分离模型实现了0.88 dB和1.4 dB的平均SI-SDR改进。我们的工作通过改善不同语言群体的沟通工具，间接促进了可持续发展目标10（减少不平等）。此外，该技术通过推进人工智能辅助技术、促进创新和建立弹性通信系统，有助于实现可持续发展目标9（产业、创新和基础设施）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.