Contrasting deep learning audio models for direct respiratory insufficiency detection versus blood oxygen saturation estimation

Intelligence-based medicine Pub Date : 2026-03-01 Epub Date: 2025-12-12 DOI:10.1016/j.ibmed.2025.100331

Marcelo Matheus Gauy , Natália Hitomi Koza , Ricardo Mikio Morita , Gabriel Rocha Stanzione , Arnaldo Cândido Júnior , Larissa Cristina Berti , Anna Sara Shafferman Levin , Ester Cerdeira Sabino , Flaviane Romani Fernandes Svartman , Marcelo Finger

{"title":"Contrasting deep learning audio models for direct respiratory insufficiency detection versus blood oxygen saturation estimation","authors":"Marcelo Matheus Gauy , Natália Hitomi Koza , Ricardo Mikio Morita , Gabriel Rocha Stanzione , Arnaldo Cândido Júnior , Larissa Cristina Berti , Anna Sara Shafferman Levin , Ester Cerdeira Sabino , Flaviane Romani Fernandes Svartman , Marcelo Finger","doi":"10.1016/j.ibmed.2025.100331","DOIUrl":null,"url":null,"abstract":"<div><div>This work aims to investigate the strengths and limitations of non-invasive audio-based deep learning methods for the detection of respiratory conditions. We contrast the performance obtained in tasks such as the expert-centered respiratory insufficiency (RI) detection with easily measured blood oxygen saturation (SpO2) estimation. Several deep learning audio models have been recently proposed for RI detection via voice and speech analysis; these models have obtained an accuracy of 95% in general patients and 97.4% in COVID-19 patients. Here, we extend those results, refining several pretrained audio neural networks (CNN6, CNN10 and CNN14) and Masked Autoencoders (Audio-MAE) for RI detection, showing that some of these models achieve near perfect accuracy (99.9% on COVID RI and 98.6% on general RI). The models were pretrained on AudioSet resulting in improved performance, with transfer learning playing a key role in the prevention of overfitting. The near-perfect RI detection performance suggests that low-cost and automated methods could be developed for assisting patient triage. In parallel, this paper seeks to verify SpO2 estimation feasibility, so we perform a 92% SpO2-threshold binary classification using the same architectures. In contrast to our findings for RI, this model yielded an accuracy below 70% and MCC-correlation below 0.3, indicating both that SpO2 estimation solely from audio is unfeasible and the presence of multiple features in the audios which are useful for RI detection, but not for SpO2 estimation. We propose that this discrepancy demonstrates the limits of voice and speech biomarkers across different diagnostic tasks under current technologies.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"13 ","pages":"Article 100331"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266652122500136X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/12 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This work aims to investigate the strengths and limitations of non-invasive audio-based deep learning methods for the detection of respiratory conditions. We contrast the performance obtained in tasks such as the expert-centered respiratory insufficiency (RI) detection with easily measured blood oxygen saturation (SpO2) estimation. Several deep learning audio models have been recently proposed for RI detection via voice and speech analysis; these models have obtained an accuracy of 95% in general patients and 97.4% in COVID-19 patients. Here, we extend those results, refining several pretrained audio neural networks (CNN6, CNN10 and CNN14) and Masked Autoencoders (Audio-MAE) for RI detection, showing that some of these models achieve near perfect accuracy (99.9% on COVID RI and 98.6% on general RI). The models were pretrained on AudioSet resulting in improved performance, with transfer learning playing a key role in the prevention of overfitting. The near-perfect RI detection performance suggests that low-cost and automated methods could be developed for assisting patient triage. In parallel, this paper seeks to verify SpO2 estimation feasibility, so we perform a 92% SpO2-threshold binary classification using the same architectures. In contrast to our findings for RI, this model yielded an accuracy below 70% and MCC-correlation below 0.3, indicating both that SpO2 estimation solely from audio is unfeasible and the presence of multiple features in the audios which are useful for RI detection, but not for SpO2 estimation. We propose that this discrepancy demonstrates the limits of voice and speech biomarkers across different diagnostic tasks under current technologies.

Abstract Image

查看原文本刊更多论文

对比深度学习音频模型用于直接呼吸功能不全检测与血氧饱和度估计

这项工作旨在研究非侵入性基于音频的深度学习方法用于检测呼吸系统疾病的优势和局限性。我们将以专家为中心的呼吸功能不全（RI）检测与易于测量的血氧饱和度（SpO2）估计等任务中的性能进行了对比。最近提出了几个深度学习音频模型，用于通过语音和语音分析进行RI检测；这些模型在普通患者中的准确率为95%，在COVID-19患者中的准确率为97.4%。在这里，我们扩展了这些结果，改进了几个预训练的音频神经网络（CNN6， CNN10和CNN14）和掩码自动编码器（audio - mae）用于RI检测，表明其中一些模型达到了近乎完美的精度（COVID RI为99.9%，普通RI为98.6%）。在AudioSet上对模型进行预训练，从而提高了性能，迁移学习在防止过拟合方面发挥了关键作用。近乎完美的RI检测性能表明，可以开发低成本和自动化的方法来协助患者分诊。同时，本文试图验证SpO2估计的可行性，因此我们使用相同的架构执行92%的SpO2阈值二值分类。与我们的研究结果相比，该模型的RI精度低于70%，mcc相关性低于0.3，这表明仅从音频中估计SpO2是不可实现的，并且音频中存在多个特征，这些特征对RI检测有用，但对SpO2估计无效。我们认为，这种差异表明在当前技术下，语音和语音生物标志物在不同诊断任务中的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊