Prosanta Barai, Gondy Leroy, Prakash Bisht, Joshua M Rothman, Sumi Lee, Jennifer Andrews, Sydney A Rice, Arif Ahmed
{"title":"增强数据质量保证的众包:缓解医疗保健大型语言模型训练中资源稀缺挑战的有效方法。","authors":"Prosanta Barai, Gondy Leroy, Prakash Bisht, Joshua M Rothman, Sumi Lee, Jennifer Andrews, Sydney A Rice, Arif Ahmed","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.</p>","PeriodicalId":72181,"journal":{"name":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","volume":"2024 ","pages":"75-84"},"PeriodicalIF":0.0000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141838/pdf/","citationCount":"0","resultStr":"{\"title\":\"Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare.\",\"authors\":\"Prosanta Barai, Gondy Leroy, Prakash Bisht, Joshua M Rothman, Sumi Lee, Jennifer Andrews, Sydney A Rice, Arif Ahmed\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.</p>\",\"PeriodicalId\":72181,\"journal\":{\"name\":\"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science\",\"volume\":\"2024 \",\"pages\":\"75-84\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141838/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare.
Large Language Models (LLMs) have demonstrated immense potential in artificial intelligence across various domains, including healthcare. However, their efficacy is hindered by the need for high-quality labeled data, which is often expensive and time-consuming to create, particularly in low-resource domains like healthcare. To address these challenges, we propose a crowdsourcing (CS) framework enriched with quality control measures at the pre-, real-time-, and post-data gathering stages. Our study evaluated the effectiveness of enhancing data quality through its impact on LLMs (Bio-BERT) for predicting autism-related symptoms. The results show that real-time quality control improves data quality by 19% compared to pre-quality control. Fine-tuning Bio-BERT using crowdsourced data generally increased recall compared to the Bio-BERT baseline but lowered precision. Our findings highlighted the potential of crowdsourcing and quality control in resource-constrained environments and offered insights into optimizing healthcare LLMs for informed decision-making and improved patient care.