Omicron detection with large language models and YouTube audio data.

James T Anibal, Adam J Landa, Nguyen T T Hang, Miranda J Song, Alec K Peltekian, Ashley Shin, Hannah B Huth, Lindsey A Hazen, Anna S Christou, Jocelyne Rivera, Robert A Morhard, Ulas Bagci, Ming Li, Yael Bensoussan, David A Clifton, Bradford J Wood
{"title":"Omicron detection with large language models and YouTube audio data.","authors":"James T Anibal, Adam J Landa, Nguyen T T Hang, Miranda J Song, Alec K Peltekian, Ashley Shin, Hannah B Huth, Lindsey A Hazen, Anna S Christou, Jocelyne Rivera, Robert A Morhard, Ulas Bagci, Ming Li, Yael Bensoussan, David A Clifton, Bradford J Wood","doi":"10.1101/2022.09.13.22279673","DOIUrl":null,"url":null,"abstract":"<p><p>Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications.</p>","PeriodicalId":18659,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9516853/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.09.13.22279673","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Publicly available audio data presents a unique opportunity for the development of digital health technologies with large language models (LLMs). In this study, YouTube was mined to collect audio data from individuals with self-declared positive COVID-19 tests as well as those with other upper respiratory infections (URI) and healthy subjects discussing a diverse range of topics. The resulting dataset was transcribed with the Whisper model and used to assess the capacity of LLMs for detecting self-reported COVID-19 cases and performing variant classification. Following prompt optimization, LLMs achieved accuracies of 0.89, 0.97, respectively, in the tasks of identifying self-reported COVID-19 cases and other respiratory illnesses. The model also obtained a mean accuracy of 0.77 at identifying the variant of self-reported COVID-19 cases using only symptoms and other health-related factors described in the YouTube videos. In comparison with past studies, which used scripted, standardized voice samples to capture biomarkers, this study focused on extracting meaningful information from public online audio data. This work introduced novel design paradigms for pandemic management tools, showing the potential of audio data in clinical and public health applications.

Abstract Image

Abstract Image

使用来自社交媒体的无脚本语音样本进行数字奥密克戎检测。
人工智能在临床环境中的成功取决于训练数据的多样性和可用性。在某些情况下,社交媒体数据可以用来平衡有限数量的可访问、精心策划的临床数据,但这种可能性在很大程度上仍未被探索。在这项研究中,我们挖掘了YouTube,以收集在奥密克戎是主要变异株1、2、3期间,新冠肺炎自我检测呈阳性的个体的语音数据,同时还对非奥密克毒株新冠肺炎变异株、其他上呼吸道感染(URI)和健康受试者进行了采样。由此产生的数据集用于训练DenseNet模型,以从语音变化中检测奥密克戎变异株。我们的模型在从健康样本中分离奥密克戎样本时实现了0.85/0.80的特异性/敏感性,在从有症状的非新冠病毒样本中分离奥密克戎样品时实现了0.76/0.70的特异性/敏感性。与过去使用脚本语音样本的研究相比,我们发现利用无脚本语音固有的样本内方差可以增强泛化能力。我们的工作介绍了基于音频的诊断工具的新设计范式,并确立了社交媒体数据训练适合现实世界部署的数字诊断模型的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信