EarSE

IF 3.6 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Di Duan, Yongliang Chen, Weitao Xu, Tianxing Li
{"title":"EarSE","authors":"Di Duan, Yongliang Chen, Weitao Xu, Tianxing Li","doi":"10.1145/3631447","DOIUrl":null,"url":null,"abstract":"Speech enhancement is regarded as the key to the quality of digital communication and is gaining increasing attention in the research field of audio processing. In this paper, we present EarSE, the first robust, hands-free, multi-modal speech enhancement solution using commercial off-the-shelf headphones. The key idea of EarSE is a novel hardware setting---leveraging the form factor of headphones equipped with a boom microphone to establish a stable acoustic sensing field across the user's face. Furthermore, we designed a sensing methodology based on Frequency-Modulated Continuous-Wave, which is an ultrasonic modality sensitive to capture subtle facial articulatory gestures of users when speaking. Moreover, we design a fully attention-based deep neural network to self-adaptively solve the user diversity problem by introducing the Vision Transformer network. We enhance the collaboration between the speech and ultrasonic modalities using a multi-head attention mechanism and a Factorized Bilinear Pooling gate. Extensive experiments demonstrate that EarSE achieves remarkable performance as increasing SiSDR by 14.61 dB and reducing the word error rate of user speech recognition by 22.45--66.41% in real-world application. EarSE not only outperforms seven baselines by 38.0% in SiSNR, 12.4% in STOI, and 20.5% in PESQ on average but also maintains practicality.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"10 9","pages":"1 - 33"},"PeriodicalIF":3.6000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3631447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Speech enhancement is regarded as the key to the quality of digital communication and is gaining increasing attention in the research field of audio processing. In this paper, we present EarSE, the first robust, hands-free, multi-modal speech enhancement solution using commercial off-the-shelf headphones. The key idea of EarSE is a novel hardware setting---leveraging the form factor of headphones equipped with a boom microphone to establish a stable acoustic sensing field across the user's face. Furthermore, we designed a sensing methodology based on Frequency-Modulated Continuous-Wave, which is an ultrasonic modality sensitive to capture subtle facial articulatory gestures of users when speaking. Moreover, we design a fully attention-based deep neural network to self-adaptively solve the user diversity problem by introducing the Vision Transformer network. We enhance the collaboration between the speech and ultrasonic modalities using a multi-head attention mechanism and a Factorized Bilinear Pooling gate. Extensive experiments demonstrate that EarSE achieves remarkable performance as increasing SiSDR by 14.61 dB and reducing the word error rate of user speech recognition by 22.45--66.41% in real-world application. EarSE not only outperforms seven baselines by 38.0% in SiSNR, 12.4% in STOI, and 20.5% in PESQ on average but also maintains practicality.
耳塞
语音增强被视为数字通信质量的关键,在音频处理研究领域日益受到关注。在本文中,我们介绍了 EarSE,这是首个使用商用现成耳机的稳健、免提、多模态语音增强解决方案。EarSE 的核心理念是一种新颖的硬件设置--充分利用耳机配备吊杆麦克风的外形优势,在用户面部建立一个稳定的声学感应场。此外,我们还设计了一种基于频率调制连续波的传感方法,这是一种敏感的超声波模式,可以捕捉用户说话时细微的面部发音手势。此外,我们还设计了一种完全基于注意力的深度神经网络,通过引入视觉转换器网络来自适应地解决用户多样性问题。我们利用多头注意力机制和因子化双线性池化门增强了语音和超声波模式之间的协作。广泛的实验证明,EarSE在实际应用中取得了显著的性能,将SiSDR提高了14.61 dB,并将用户语音识别的词错误率降低了22.45%-66.41%。EarSE 不仅在 SiSNR、STOI 和 PESQ 方面平均分别比七种基线方法高出 38.0%、12.4% 和 20.5%,而且保持了实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Computer Science-Computer Networks and Communications
CiteScore
9.10
自引率
0.00%
发文量
154
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信