V-Speech

IF 0.7 Q4 TELECOMMUNICATIONS
H. A. C. Maruri, P. López-Meyer, Jonathan Huang, W. Beltman, L. Nachman, Hong Lu
{"title":"V-Speech","authors":"H. A. C. Maruri, P. López-Meyer, Jonathan Huang, W. Beltman, L. Nachman, Hong Lu","doi":"10.1145/3427384.3427392","DOIUrl":null,"url":null,"abstract":"Smart glasses are often used in noisy public spaces or industrial settings. Voice commands and automatic speech recognition (ASR) are good user interfaces for such a form factor, but the background noise and interfering speakers pose important challenges. Typical signal processing techniques have limitations in performance and/or hardware resources. V-Speech is a novel solution that captures the voice signal with a vibration sensor located in the nasal pads of smart glasses. Although signal-to-noise ratio (SNR) is much higher with vibration sensor capture, it introduces a \"nasal distortion,\" which must be dealt with. The second part of our proposed solution involves a voice transformation of the vibration signal using a neural network to produce an output that mimics the characteristics of a conventional microphone. We evaluated V-Speech in noise-free and very noisy conditions with 30 volunteer speakers uttering 145 phrases each, and validated its performance on ASR engines, with assessments of voice quality using the Perceptual Evaluation of Speech Quality (PESQ) metric, and with subjective listeners to determine intelligibility, naturalness and overall quality. The results show, in extreme noise conditions, a mean improvement of 50% for Word Error Rate (WER), 1.0 on a scale of 5.0 for PESQ, and speech regarded intelligible, with naturalness rated as fair to good. The output of V-Speech has low noise, sounds natural, and enables clear voice communication in challenging environments.","PeriodicalId":29918,"journal":{"name":"GetMobile-Mobile Computing & Communications Review","volume":"710 ","pages":"18 - 24"},"PeriodicalIF":0.7000,"publicationDate":"2020-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3427384.3427392","citationCount":"0","resultStr":"{\"title\":\"V-Speech\",\"authors\":\"H. A. C. Maruri, P. López-Meyer, Jonathan Huang, W. Beltman, L. Nachman, Hong Lu\",\"doi\":\"10.1145/3427384.3427392\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Smart glasses are often used in noisy public spaces or industrial settings. Voice commands and automatic speech recognition (ASR) are good user interfaces for such a form factor, but the background noise and interfering speakers pose important challenges. Typical signal processing techniques have limitations in performance and/or hardware resources. V-Speech is a novel solution that captures the voice signal with a vibration sensor located in the nasal pads of smart glasses. Although signal-to-noise ratio (SNR) is much higher with vibration sensor capture, it introduces a \\\"nasal distortion,\\\" which must be dealt with. The second part of our proposed solution involves a voice transformation of the vibration signal using a neural network to produce an output that mimics the characteristics of a conventional microphone. We evaluated V-Speech in noise-free and very noisy conditions with 30 volunteer speakers uttering 145 phrases each, and validated its performance on ASR engines, with assessments of voice quality using the Perceptual Evaluation of Speech Quality (PESQ) metric, and with subjective listeners to determine intelligibility, naturalness and overall quality. The results show, in extreme noise conditions, a mean improvement of 50% for Word Error Rate (WER), 1.0 on a scale of 5.0 for PESQ, and speech regarded intelligible, with naturalness rated as fair to good. The output of V-Speech has low noise, sounds natural, and enables clear voice communication in challenging environments.\",\"PeriodicalId\":29918,\"journal\":{\"name\":\"GetMobile-Mobile Computing & Communications Review\",\"volume\":\"710 \",\"pages\":\"18 - 24\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2020-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/3427384.3427392\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GetMobile-Mobile Computing & Communications Review\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3427384.3427392\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"TELECOMMUNICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GetMobile-Mobile Computing & Communications Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3427384.3427392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

智能眼镜通常用于嘈杂的公共场所或工业环境。语音命令和自动语音识别(ASR)是这种形式因素的良好用户界面,但背景噪声和干扰扬声器构成了重要的挑战。典型的信号处理技术在性能和/或硬件资源方面有限制。V-Speech是一种新颖的解决方案,它通过位于智能眼镜鼻垫中的振动传感器捕捉语音信号。虽然振动传感器捕获的信噪比(SNR)要高得多,但它引入了必须处理的“鼻失真”。我们提出的解决方案的第二部分涉及使用神经网络对振动信号进行语音转换,以产生模仿传统麦克风特性的输出。我们在无噪声和非常嘈杂的条件下评估了V-Speech, 30名志愿者演讲者每人说出145个短语,并验证了其在ASR引擎上的性能,使用语音质量感知评估(PESQ)指标评估语音质量,并与主观听众一起确定可理解性,自然性和整体质量。结果表明,在极端噪声条件下,单词错误率(WER)平均提高了50%,PESQ平均提高了1.0(5.0),语音可理解,自然度被评为一般到良好。V-Speech的输出噪音低,声音自然,可以在具有挑战性的环境中实现清晰的语音通信。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
V-Speech
Smart glasses are often used in noisy public spaces or industrial settings. Voice commands and automatic speech recognition (ASR) are good user interfaces for such a form factor, but the background noise and interfering speakers pose important challenges. Typical signal processing techniques have limitations in performance and/or hardware resources. V-Speech is a novel solution that captures the voice signal with a vibration sensor located in the nasal pads of smart glasses. Although signal-to-noise ratio (SNR) is much higher with vibration sensor capture, it introduces a "nasal distortion," which must be dealt with. The second part of our proposed solution involves a voice transformation of the vibration signal using a neural network to produce an output that mimics the characteristics of a conventional microphone. We evaluated V-Speech in noise-free and very noisy conditions with 30 volunteer speakers uttering 145 phrases each, and validated its performance on ASR engines, with assessments of voice quality using the Perceptual Evaluation of Speech Quality (PESQ) metric, and with subjective listeners to determine intelligibility, naturalness and overall quality. The results show, in extreme noise conditions, a mean improvement of 50% for Word Error Rate (WER), 1.0 on a scale of 5.0 for PESQ, and speech regarded intelligible, with naturalness rated as fair to good. The output of V-Speech has low noise, sounds natural, and enables clear voice communication in challenging environments.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
34
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信