基于单麦克风深度包络分离的竞争性语音和音乐听觉注意解码。

IF 3.8

Journal of neural engineering Pub Date : 2025-05-07 DOI:10.1088/1741-2552/add0e7

M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard

{"title":"基于单麦克风深度包络分离的竞争性语音和音乐听觉注意解码。","authors":"M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard","doi":"10.1088/1741-2552/add0e7","DOIUrl":null,"url":null,"abstract":"Objective.In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.Approach.Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.Main results.We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.Significance.We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.","PeriodicalId":94096,"journal":{"name":"Journal of neural engineering","volume":"22 3","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single-microphone deep envelope separation based auditory attention decoding for competing speech and music.\",\"authors\":\"M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard\",\"doi\":\"10.1088/1741-2552/add0e7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective.In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.Approach.Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.Main results.We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.Significance.We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.\",\"PeriodicalId\":94096,\"journal\":{\"name\":\"Journal of neural engineering\",\"volume\":\"22 3\",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of neural engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1088/1741-2552/add0e7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neural engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1741-2552/add0e7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目标。在本研究中，我们介绍了一个端到端单麦克风深度学习系统，用于语音和音乐竞争环境中的源分离和听觉注意解码（AAD）。深源分离直接应用于观察到的混合音频信号的包络。将得到的分离包络与通过深度刺激重建从脑电图（EEG）信号中获得的包络进行比较，其中Pearson相关被用作训练和评估的损失函数。方法：源包络分离和AAD的深度学习模型是在语音和音乐的目标/干扰物对上进行训练的，包括四种情况：语音vs语音，语音vs音乐，音乐vs语音，音乐vs音乐。我们将10种不同的hrtf与音频信号进行卷积，以模拟头部，躯干和外耳的效果，并评估我们模型的泛化能力。这些模型是在从60秒脑电图试验中提取的20秒时间窗上训练（和评估）的。主要的结果。我们在原始数据集上实现了0.122%和82.4%的目标Pearson相关性和准确性，在10个HRTF变体上实现了0.106%和75.4%的平均目标Pearson相关性和准确性。对于干扰物，我们获得了0.004的平均Pearson相关性。此外，我们的模型在上述四种情况下对语音和音乐的准确率分别为82.8%、85.8%、79.7%和81.5%。在完全分离包络的情况下，我们可以达到83.0%的准确率，这与源分离包络的情况相当。意义：我们得出结论，源包络分离和AAD的深度学习模型可以很好地推广到本研究测试的语音和音乐信号集以及HRTFs。我们注意到，对于混合音乐和语音信号，源分离性能较差，但由此产生的AAD性能不受影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Single-microphone deep envelope separation based auditory attention decoding for competing speech and music.

Objective.In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.Approach.Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.Main results.We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.Significance.We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of neural engineering

自引率

0.00%

发文量