Marcelo Matheus Gauy , Natália Hitomi Koza , Ricardo Mikio Morita , Gabriel Rocha Stanzione , Arnaldo Cândido Júnior , Larissa Cristina Berti , Anna Sara Shafferman Levin , Ester Cerdeira Sabino , Flaviane Romani Fernandes Svartman , Marcelo Finger
{"title":"Contrasting deep learning audio models for direct respiratory insufficiency detection versus blood oxygen saturation estimation","authors":"Marcelo Matheus Gauy , Natália Hitomi Koza , Ricardo Mikio Morita , Gabriel Rocha Stanzione , Arnaldo Cândido Júnior , Larissa Cristina Berti , Anna Sara Shafferman Levin , Ester Cerdeira Sabino , Flaviane Romani Fernandes Svartman , Marcelo Finger","doi":"10.1016/j.ibmed.2025.100331","DOIUrl":null,"url":null,"abstract":"<div><div>This work aims to investigate the strengths and limitations of non-invasive audio-based deep learning methods for the detection of respiratory conditions. We contrast the performance obtained in tasks such as the expert-centered respiratory insufficiency (RI) detection with easily measured blood oxygen saturation (SpO2) estimation. Several deep learning audio models have been recently proposed for RI detection via voice and speech analysis; these models have obtained an accuracy of 95% in general patients and 97.4% in COVID-19 patients. Here, we extend those results, refining several pretrained audio neural networks (CNN6, CNN10 and CNN14) and Masked Autoencoders (Audio-MAE) for RI detection, showing that some of these models achieve near perfect accuracy (99.9% on COVID RI and 98.6% on general RI). The models were pretrained on AudioSet resulting in improved performance, with transfer learning playing a key role in the prevention of overfitting. The near-perfect RI detection performance suggests that low-cost and automated methods could be developed for assisting patient triage. In parallel, this paper seeks to verify SpO2 estimation feasibility, so we perform a 92% SpO2-threshold binary classification using the same architectures. In contrast to our findings for RI, this model yielded an accuracy below 70% and MCC-correlation below 0.3, indicating both that SpO2 estimation solely from audio is unfeasible and the presence of multiple features in the audios which are useful for RI detection, but not for SpO2 estimation. We propose that this discrepancy demonstrates the limits of voice and speech biomarkers across different diagnostic tasks under current technologies.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"13 ","pages":"Article 100331"},"PeriodicalIF":0.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266652122500136X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/12 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This work aims to investigate the strengths and limitations of non-invasive audio-based deep learning methods for the detection of respiratory conditions. We contrast the performance obtained in tasks such as the expert-centered respiratory insufficiency (RI) detection with easily measured blood oxygen saturation (SpO2) estimation. Several deep learning audio models have been recently proposed for RI detection via voice and speech analysis; these models have obtained an accuracy of 95% in general patients and 97.4% in COVID-19 patients. Here, we extend those results, refining several pretrained audio neural networks (CNN6, CNN10 and CNN14) and Masked Autoencoders (Audio-MAE) for RI detection, showing that some of these models achieve near perfect accuracy (99.9% on COVID RI and 98.6% on general RI). The models were pretrained on AudioSet resulting in improved performance, with transfer learning playing a key role in the prevention of overfitting. The near-perfect RI detection performance suggests that low-cost and automated methods could be developed for assisting patient triage. In parallel, this paper seeks to verify SpO2 estimation feasibility, so we perform a 92% SpO2-threshold binary classification using the same architectures. In contrast to our findings for RI, this model yielded an accuracy below 70% and MCC-correlation below 0.3, indicating both that SpO2 estimation solely from audio is unfeasible and the presence of multiple features in the audios which are useful for RI detection, but not for SpO2 estimation. We propose that this discrepancy demonstrates the limits of voice and speech biomarkers across different diagnostic tasks under current technologies.