Chen Ma , Yue Zhang , Yina Guo , Xin Liu , Hong Shangguan , Juan Wang , Luqing Zhao
{"title":"Fully end-to-end EEG to speech translation using multi-scale optimized dual generative adversarial network with cycle-consistency loss","authors":"Chen Ma , Yue Zhang , Yina Guo , Xin Liu , Hong Shangguan , Juan Wang , Luqing Zhao","doi":"10.1016/j.neucom.2024.128916","DOIUrl":null,"url":null,"abstract":"<div><div>Decoding auditory evoked electroencephalographic (EEG) signals to correlate them with speech acoustic features and construct transitional signals between different domain signals is a challenging and fascinating research topic. Brain–computer interface (BCI) technologies that incorporate auditory evoked potentials (AEPs) can not only leverage encoder–decoder architectures for signal decoding, but also employ generative adversarial networks (GANs) to translate from human neural activity to speech (T-HNAS). However, in previous research, the cascading ratio of transitional signals leads to varying degrees of information loss in the two-domain signals, and the optimal ratio of transitional signals differs across datasets, impacting the translation effectiveness. To address these issues, an improved dual generative adversarial network based on multi-scale optimization and cycle-consistency loss (MSCC-DualGAN) is proposed. We leverage the feature of cycle consistency loss, which facilitates cross-modal signal conversion, to replace transitional signals and maintain the integrity of signals in both domains during the loss computation process. Multi-scale optimization is utilized to refine the details of signals downsampled by the network, improving the similarity between features, thus enabling efficient, fully end-to-end EEG to speech translation. Furthermore, to validate the efficacy of this network, we construct a new EEG dataset and conduct studies using metrics such as mel cepstral distortion (MCD), pearson correlation coefficient (PCC), and structural similarity index measure (SSIM). Experimental results demonstrate that this new network significantly outperforms previous methods on auditory stimulus datasets.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"616 ","pages":"Article 128916"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016874","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Decoding auditory evoked electroencephalographic (EEG) signals to correlate them with speech acoustic features and construct transitional signals between different domain signals is a challenging and fascinating research topic. Brain–computer interface (BCI) technologies that incorporate auditory evoked potentials (AEPs) can not only leverage encoder–decoder architectures for signal decoding, but also employ generative adversarial networks (GANs) to translate from human neural activity to speech (T-HNAS). However, in previous research, the cascading ratio of transitional signals leads to varying degrees of information loss in the two-domain signals, and the optimal ratio of transitional signals differs across datasets, impacting the translation effectiveness. To address these issues, an improved dual generative adversarial network based on multi-scale optimization and cycle-consistency loss (MSCC-DualGAN) is proposed. We leverage the feature of cycle consistency loss, which facilitates cross-modal signal conversion, to replace transitional signals and maintain the integrity of signals in both domains during the loss computation process. Multi-scale optimization is utilized to refine the details of signals downsampled by the network, improving the similarity between features, thus enabling efficient, fully end-to-end EEG to speech translation. Furthermore, to validate the efficacy of this network, we construct a new EEG dataset and conduct studies using metrics such as mel cepstral distortion (MCD), pearson correlation coefficient (PCC), and structural similarity index measure (SSIM). Experimental results demonstrate that this new network significantly outperforms previous methods on auditory stimulus datasets.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.