M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard
{"title":"基于单麦克风深度包络分离的竞争性语音和音乐听觉注意解码。","authors":"M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard","doi":"10.1088/1741-2552/add0e7","DOIUrl":null,"url":null,"abstract":"<p><p><i>Objective.</i>In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.<i>Approach.</i>Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.<i>Main results.</i>We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.<i>Significance.</i>We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.</p>","PeriodicalId":94096,"journal":{"name":"Journal of neural engineering","volume":"22 3","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single-microphone deep envelope separation based auditory attention decoding for competing speech and music.\",\"authors\":\"M Asjid Tanveer, Jesper Jensen, Zheng-Hua Tan, Jan Østergaard\",\"doi\":\"10.1088/1741-2552/add0e7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><i>Objective.</i>In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.<i>Approach.</i>Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.<i>Main results.</i>We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.<i>Significance.</i>We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.</p>\",\"PeriodicalId\":94096,\"journal\":{\"name\":\"Journal of neural engineering\",\"volume\":\"22 3\",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of neural engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1088/1741-2552/add0e7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neural engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1741-2552/add0e7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Single-microphone deep envelope separation based auditory attention decoding for competing speech and music.
Objective.In this study, we introduce an end-to-end single microphone deep learning system for source separation and auditory attention decoding (AAD) in a competing speech and music setup. Deep source separation is applied directly on the envelope of the observed mixed audio signal. The resulting separated envelopes are compared to the envelope obtained from the electroencephalography (EEG) signals via deep stimulus reconstruction, where Pearson correlation is used as a loss function for training and evaluation.Approach.Deep learning models for source envelope separation and AAD are trained on target/distractor pairs from speech and music, covering four cases: speech vs. speech, speech vs. music, music vs. speech, and music vs. music. We convolve 10 different HRTFs with our audio signals to simulate the effects of head, torso and outer ear, and evaluate our model's ability to generalize. The models are trained (and evaluated) on 20 s time windows extracted from 60 s EEG trials.Main results.We achieve a target Pearson correlation and accuracy of 0.122% and 82.4% on the original dataset and an average target Pearson correlation and accuracy of 0.106% and 75.4% across the 10 HRTF variants. For the distractor, we achieve an average Pearson correlation of 0.004. Additionally, our model gives an accuracy of 82.8%, 85.8%, 79.7% and 81.5% across the four aforementioned cases for speech and music. With perfectly separated envelopes, we can achieve an accuracy of 83.0%, which is comparable to the case of source separated envelopes.Significance.We conclude that the deep learning models for source envelope separation and AAD generalize well across the set of speech and music signals and HRTFs tested in this study. We notice that source separation performs worse for a mixed music and speech signal, but the resulting AAD performance is not impacted.