Multi-Class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum

IF 5.2 2区医学 Q2 ENGINEERING, BIOMEDICAL

IEEE Transactions on Neural Systems and Rehabilitation Engineering Pub Date : 2025-07-23 DOI:10.1109/TNSRE.2025.3591819

Yuanming Zhang;Jing Lu;Fei Chen;Haoliang Du;Xia Gao;Zhibin Lin

{"title":"Multi-Class Decoding of Attended Speaker Direction Using Electroencephalogram and Audio Spatial Spectrum","authors":"Yuanming Zhang;Jing Lu;Fei Chen;Haoliang Du;Xia Gao;Zhibin Lin","doi":"10.1109/TNSRE.2025.3591819","DOIUrl":null,"url":null,"abstract":"Prior research on directional focus decoding, a.k.a. selective Auditory Attention Decoding (sAAD), has primarily focused on binary “left-right” tasks. However, decoding of the attended speaker’s precise direction is desired. Existing approaches often underutilize spatial audio information, resulting in suboptimal performance. In this paper, we address this limitation by leveraging a recent dataset containing two concurrent speakers at two of 14 possible directions. We demonstrate that models relying solely on EEG yield limited decoding accuracy in leave-one-out settings. To enhance performance, we propose to integrate spatial spectra as an additional input. We evaluate three model architectures, namely CNN, LSM-CNN, and Deformer, under two strategies for utilizing spatial information: all-in-one (end-to-end) and pairwise (two-stage) decoding. While all-in-one decoders directly take dual-modal inputs and output the attended direction, pairwise decoders first leverage spatial spectra to decode the competing pairs, and then a specific model is used to decode the attended direction. Our proposed all-in-one Sp-EEG-Deformer model achieves 14-class decoding accuracies of 55.35% and 57.19% in leave-one-subject-out and leave-one-trial-out scenarios, respectively, using 1-second decision windows (chance level: 50%, indicating random guessing). Meanwhile, the pairwise Sp-EEG-Deformer decoder achieves a 14-class decoding accuracy of 63.62% (10 s). Our experiments reveal that spatial spectra are particularly effective at reducing the 14-class problem into a binary one. On the other hand, EEG features are more discriminative and play a crucial role in precisely identifying the final attended direction within this reduced 2-class set. These results highlight the effectiveness of our proposed dual-modal directional decoding strategies.","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"33 ","pages":"2892-2903"},"PeriodicalIF":5.2000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11091336","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11091336/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Prior research on directional focus decoding, a.k.a. selective Auditory Attention Decoding (sAAD), has primarily focused on binary “left-right” tasks. However, decoding of the attended speaker’s precise direction is desired. Existing approaches often underutilize spatial audio information, resulting in suboptimal performance. In this paper, we address this limitation by leveraging a recent dataset containing two concurrent speakers at two of 14 possible directions. We demonstrate that models relying solely on EEG yield limited decoding accuracy in leave-one-out settings. To enhance performance, we propose to integrate spatial spectra as an additional input. We evaluate three model architectures, namely CNN, LSM-CNN, and Deformer, under two strategies for utilizing spatial information: all-in-one (end-to-end) and pairwise (two-stage) decoding. While all-in-one decoders directly take dual-modal inputs and output the attended direction, pairwise decoders first leverage spatial spectra to decode the competing pairs, and then a specific model is used to decode the attended direction. Our proposed all-in-one Sp-EEG-Deformer model achieves 14-class decoding accuracies of 55.35% and 57.19% in leave-one-subject-out and leave-one-trial-out scenarios, respectively, using 1-second decision windows (chance level: 50%, indicating random guessing). Meanwhile, the pairwise Sp-EEG-Deformer decoder achieves a 14-class decoding accuracy of 63.62% (10 s). Our experiments reveal that spatial spectra are particularly effective at reducing the 14-class problem into a binary one. On the other hand, EEG features are more discriminative and play a crucial role in precisely identifying the final attended direction within this reduced 2-class set. These results highlight the effectiveness of our proposed dual-modal directional decoding strategies.

查看原文本刊更多论文

基于脑电图和音频空间谱的有听众方向多类解码。

先前对定向焦点解码的研究，又称选择性听觉注意解码（sAAD），主要集中在二元“左右”任务上。然而，需要解码出席的说话人的精确方向。现有的方法往往没有充分利用空间音频信息，导致性能不理想。在本文中，我们通过利用最近的数据集来解决这一限制，该数据集包含14个可能方向中的两个并发演讲者。我们证明了仅依赖脑电图的模型在留一设置下产生有限的解码精度。为了提高性能，我们建议将空间光谱作为一个额外的输入。我们在两种利用空间信息的策略下评估了三种模型架构，即CNN、LSM-CNN和Deformer: all-in-one（端到端）和pair - wise（两阶段）解码。一体化解码器直接采用双模输入输出出席方向，而成对解码器首先利用空间频谱对竞争对进行解码，然后使用特定模型对出席方向进行解码。我们提出的一体化Sp-EEG-Deformer模型使用1秒决策窗口（机会水平为50%，表明随机猜测），在留一个受试者和留一个试验场景下，分别实现了55.35%和57.19%的14级解码准确率。同时，双向Sp-EEG-Deformer解码器实现了14级解码精度63.62% （10 s）。我们的实验表明，空间光谱在将14类问题简化为二元问题方面特别有效。另一方面，脑电特征具有更强的判别性，在精确识别最终参与方向方面起着至关重要的作用。这些结果突出了我们提出的双峰方向解码策略的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Neural Systems and Rehabilitation Engineering 医学-工程：生物医学

CiteScore

8.60

自引率

8.20%

发文量

479

审稿时长

6-12 weeks

期刊介绍： Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.