Spectral–temporal saliency masks and modulation tensorgrams for generalizable COVID-19 detection

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-02-01 DOI:10.1016/j.csl.2024.101620

Yi Zhu, Tiago H. Falk

{"title":"Spectral–temporal saliency masks and modulation tensorgrams for generalizable COVID-19 detection","authors":"Yi Zhu, Tiago H. Falk","doi":"10.1016/j.csl.2024.101620","DOIUrl":null,"url":null,"abstract":"<div><p>Speech COVID-19 detection systems have gained popularity as they represent an easy-to-use and low-cost solution that is well suited for at-home long-term monitoring of patients with persistent symptoms. Recently, however, the limited generalization capability of existing deep neural network based systems to unseen datasets has been raised as a serious concern, as has their limited interpretability. In this study, we aim to develop an interpretable and generalizable speech-based COVID-19 detection system. First, we propose the use of a 3-dimensional modulation frequency tensor (called modulation tensorgram representation, MTR) as input to a convolutional recurrent neural network for COVID-19 detection. The MTR representation is known to capture long-term dynamics of speech correlated with articulation and respiration, hence being a potential candidate for characterizing COVID-19 speech. The customized network explores both the spectral and temporal pattern from MTR to learn the underlying COVID-19 speech pattern. Next, we design a spectro-temporal saliency masking to aggregate regions of the MTR related to COVID-19, thus helping further improve the generalizability and interpretability of the model. Experiments are conducted on three public datasets and results show the proposed solution consistently outperforming two benchmark systems in within-, across-, and unseen-dataset tests. The learned salient regions have been shown correlated with whispered speech and vocal hoarseness, which explains the increased generalizability. Furthermore, our model relies on a small amount of parameters, thus offering a promising solution for on-device remote monitoring of COVID-19 infection.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101620"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000032/pdfft?md5=e39e0b3ee7ea45c5b9c50622ff48dbd4&pid=1-s2.0-S0885230824000032-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000032","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speech COVID-19 detection systems have gained popularity as they represent an easy-to-use and low-cost solution that is well suited for at-home long-term monitoring of patients with persistent symptoms. Recently, however, the limited generalization capability of existing deep neural network based systems to unseen datasets has been raised as a serious concern, as has their limited interpretability. In this study, we aim to develop an interpretable and generalizable speech-based COVID-19 detection system. First, we propose the use of a 3-dimensional modulation frequency tensor (called modulation tensorgram representation, MTR) as input to a convolutional recurrent neural network for COVID-19 detection. The MTR representation is known to capture long-term dynamics of speech correlated with articulation and respiration, hence being a potential candidate for characterizing COVID-19 speech. The customized network explores both the spectral and temporal pattern from MTR to learn the underlying COVID-19 speech pattern. Next, we design a spectro-temporal saliency masking to aggregate regions of the MTR related to COVID-19, thus helping further improve the generalizability and interpretability of the model. Experiments are conducted on three public datasets and results show the proposed solution consistently outperforming two benchmark systems in within-, across-, and unseen-dataset tests. The learned salient regions have been shown correlated with whispered speech and vocal hoarseness, which explains the increased generalizability. Furthermore, our model relies on a small amount of parameters, thus offering a promising solution for on-device remote monitoring of COVID-19 infection.

查看原文本刊更多论文

用于通用 COVID-19 检测的频谱-时序突出掩码和调制张量图

语音 COVID-19 检测系统是一种易于使用且成本低廉的解决方案，非常适合在家中对有持续症状的患者进行长期监测，因此广受欢迎。但最近，现有基于深度神经网络的系统对未见数据集的泛化能力有限以及可解释性有限的问题引起了人们的严重关注。在本研究中，我们旨在开发一种可解释、可泛化的基于语音的 COVID-19 检测系统。首先，我们建议使用三维调制频率张量（称为调制张量图表示法，MTR）作为卷积递归神经网络的输入，用于 COVID-19 检测。众所周知，MTR 表示法能捕捉与发音和呼吸相关的语音长期动态，因此是描述 COVID-19 语音特征的潜在候选方法。定制网络从 MTR 中探索频谱和时间模式，以学习 COVID-19 的基本语音模式。接下来，我们设计了一种频谱-时间显著性掩蔽，以聚合 MTR 中与 COVID-19 相关的区域，从而有助于进一步提高模型的通用性和可解释性。实验在三个公共数据集上进行，结果表明所提出的解决方案在内部、跨数据集和未见数据集测试中的表现始终优于两个基准系统。实验结果表明，所学的突出区域与耳语语音和声音嘶哑相关，这也是通用性提高的原因。此外，我们的模型只需少量参数，因此为设备远程监控 COVID-19 感染提供了一个很有前景的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.