Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings.

Daniel M Low, Vishwanatha Rao, Gregory Randolph, Phillip C Song, Satrajit S Ghosh
{"title":"Identifying bias in models that detect vocal fold paralysis from audio recordings using explainable machine learning and clinician ratings.","authors":"Daniel M Low, Vishwanatha Rao, Gregory Randolph, Phillip C Song, Satrajit S Ghosh","doi":"10.1101/2020.11.23.20235945","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance.</p><p><strong>Methods: </strong>Patients with confirmed UVFP through endoscopic examination (N=77) and controls with normal voices matched for age and sex (N=77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel \"a\". Four machine learning models of differing complexity were used. SHapley Additive explanations (SHAP) was used to identify important features.</p><p><strong>Results: </strong>The highest median bootstrapped ROC AUC score was 0.87 and beat clinician's performance (range: 0.74 - 0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis.</p><p><strong>Conclusion: </strong>We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician's ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.</p>","PeriodicalId":18659,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/9e/ea/nihpp-2020.11.23.20235945v6.PMC7836138.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2020.11.23.20235945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Detecting voice disorders from voice recordings could allow for frequent, remote, and low-cost screening before costly clinical visits and a more invasive laryngoscopy examination. Our goals were to detect unilateral vocal fold paralysis (UVFP) from voice recordings using machine learning, to identify which acoustic variables were important for prediction to increase trust, and to determine model performance relative to clinician performance.

Methods: Patients with confirmed UVFP through endoscopic examination (N=77) and controls with normal voices matched for age and sex (N=77) were included. Voice samples were elicited by reading the Rainbow Passage and sustaining phonation of the vowel "a". Four machine learning models of differing complexity were used. SHapley Additive explanations (SHAP) was used to identify important features.

Results: The highest median bootstrapped ROC AUC score was 0.87 and beat clinician's performance (range: 0.74 - 0.81) based on the recordings. Recording durations were different between UVFP recordings and controls due to how that data was originally processed when storing, which we can show can classify both groups. And counterintuitively, many UVFP recordings had higher intensity than controls, when UVFP patients tend to have weaker voices, revealing a dataset-specific bias which we mitigate in an additional analysis.

Conclusion: We demonstrate that recording biases in audio duration and intensity created dataset-specific differences between patients and controls, which models used to improve classification. Furthermore, clinician's ratings provide further evidence that patients were over-projecting their voices and being recorded at a higher amplitude signal than controls. Interestingly, after matching audio duration and removing variables associated with intensity in order to mitigate the biases, the models were able to achieve a similar high performance. We provide a set of recommendations to avoid bias when building and evaluating machine learning models for screening in laryngology.

使用可解释的机器学习和临床医生评分,识别从录音中检测声带麻痹的模型中的偏差。
引言:从语音记录中检测语音障碍可以在昂贵的临床就诊和更具侵入性的喉镜检查之前进行频繁、远程和低成本的筛查。我们的目标是使用机器学习从语音记录中检测单侧声带麻痹(UVFP),确定哪些声学变量对预测很重要,以增加信任,并确定模型性能相对于临床医生的性能。方法:纳入经内镜检查确诊为UVFP的患者(N=77)和年龄和性别匹配的正常声音对照组(N=77。语音样本是通过阅读彩虹通道和保持元音“a”的发音来获得的。使用了四个不同复杂度的机器学习模型。SHAP用于识别重要特征。结果:根据记录,自举ROC AUC得分的最高中位数为0.87,超过了临床医生的表现(范围:0.74-0.81)。与直觉相反,许多UVFP记录的强度高于对照组。我们使用临床医生的评分来提供证据,证明声音较弱的UVFP患者过度突出了他们的声音,并且被录音的麦克风增益比对照组更高,这使得模型能够利用这种录音特性来改进分类。有趣的是,当去除与强度变量相关的所有变量以减轻偏差时,模型仍然能够实现类似的高性能。结论:使用迄今为止研究UVFP的最大数据集,我们只需几秒钟的语音记录就实现了高性能,超过了专业临床医生的性能。我们发现,当个体声音柔和时,声音生物标志物研究中可能会出现偏差。我们提供了一组建议,以避免在建立和评估用于咽喉科筛查的机器学习模型时存在偏见。因此,可解释的机器学习提供了一种机制来检测UVFP,揭示声学变量如何表征特定的病理生理学,并揭示偏见。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信