{"title":"用于分析全球卫生调查开放文本的大型语言模型:刚果民主共和国儿童无法获得疫苗服务的原因","authors":"Roy Burstein, Eric Mafuta, Joshua L Proctor","doi":"10.1093/inthealth/ihaf015","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study evaluates the use of large language models (LLMs) to analyze free-text responses from large-scale global health surveys, using data from the Enquête de Couverture Vaccinale (ECV) household coverage surveys from 2020, 2021, 2022 and 2023 as a case study.</p><p><strong>Methods: </strong>We tested several LLM approaches consisting of zero-shot and few-shot prompting, fine-tuning, and a natural language processing approach using semantic embeddings, to analyze responses on the reasons caregivers did not vaccinate their children.</p><p><strong>Results: </strong>Performance ranged from 61.5% to 96% based on testing against a curated benchmarking dataset drawn from the ECV surveys, with accuracy improving when LLMs were fine-tuned or provided examples for few-shot learning. We show that even with as few as 20-100 examples, LLMs can achieve high accuracy in categorizing free-text responses.</p><p><strong>Conclusions: </strong>This approach offers significant opportunities for reanalyzing existing datasets and designing surveys with more open-ended questions, providing a scalable, cost-effective solution for global health organizations. Despite challenges with closed-source models and computational costs, the study underscores LLMs' potential to enhance data analysis and inform global health policy.</p>","PeriodicalId":49060,"journal":{"name":"International Health","volume":" ","pages":"843-852"},"PeriodicalIF":2.2000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12406778/pdf/","citationCount":"0","resultStr":"{\"title\":\"Large language models for analyzing open text in global health surveys: why children are not accessing vaccine services in the Democratic Republic of the Congo.\",\"authors\":\"Roy Burstein, Eric Mafuta, Joshua L Proctor\",\"doi\":\"10.1093/inthealth/ihaf015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>This study evaluates the use of large language models (LLMs) to analyze free-text responses from large-scale global health surveys, using data from the Enquête de Couverture Vaccinale (ECV) household coverage surveys from 2020, 2021, 2022 and 2023 as a case study.</p><p><strong>Methods: </strong>We tested several LLM approaches consisting of zero-shot and few-shot prompting, fine-tuning, and a natural language processing approach using semantic embeddings, to analyze responses on the reasons caregivers did not vaccinate their children.</p><p><strong>Results: </strong>Performance ranged from 61.5% to 96% based on testing against a curated benchmarking dataset drawn from the ECV surveys, with accuracy improving when LLMs were fine-tuned or provided examples for few-shot learning. We show that even with as few as 20-100 examples, LLMs can achieve high accuracy in categorizing free-text responses.</p><p><strong>Conclusions: </strong>This approach offers significant opportunities for reanalyzing existing datasets and designing surveys with more open-ended questions, providing a scalable, cost-effective solution for global health organizations. Despite challenges with closed-source models and computational costs, the study underscores LLMs' potential to enhance data analysis and inform global health policy.</p>\",\"PeriodicalId\":49060,\"journal\":{\"name\":\"International Health\",\"volume\":\" \",\"pages\":\"843-852\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12406778/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/inthealth/ihaf015\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/inthealth/ihaf015","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
摘要
背景:本研究以2020年、2021年、2022年和2023年Enquête de Couverture Vaccinale (ECV)家庭覆盖率调查的数据为案例研究,评估了大型语言模型(llm)在分析大规模全球健康调查的自由文本回复中的使用情况。方法:我们测试了几种LLM方法,包括零针和少针提示、微调和使用语义嵌入的自然语言处理方法,以分析照顾者不给孩子接种疫苗的原因。结果:基于从ECV调查中提取的精选基准数据集的测试,性能范围从61.5%到96%不等,当llm进行微调或提供少量学习示例时,准确性有所提高。我们表明,即使只有20-100个示例,llm也可以在对自由文本响应进行分类时达到很高的准确性。结论:该方法为重新分析现有数据集和设计带有更多开放式问题的调查提供了重要机会,为全球卫生组织提供了可扩展的、具有成本效益的解决方案。尽管存在闭源模型和计算成本方面的挑战,但该研究强调了法学硕士在加强数据分析和为全球卫生政策提供信息方面的潜力。
Large language models for analyzing open text in global health surveys: why children are not accessing vaccine services in the Democratic Republic of the Congo.
Background: This study evaluates the use of large language models (LLMs) to analyze free-text responses from large-scale global health surveys, using data from the Enquête de Couverture Vaccinale (ECV) household coverage surveys from 2020, 2021, 2022 and 2023 as a case study.
Methods: We tested several LLM approaches consisting of zero-shot and few-shot prompting, fine-tuning, and a natural language processing approach using semantic embeddings, to analyze responses on the reasons caregivers did not vaccinate their children.
Results: Performance ranged from 61.5% to 96% based on testing against a curated benchmarking dataset drawn from the ECV surveys, with accuracy improving when LLMs were fine-tuned or provided examples for few-shot learning. We show that even with as few as 20-100 examples, LLMs can achieve high accuracy in categorizing free-text responses.
Conclusions: This approach offers significant opportunities for reanalyzing existing datasets and designing surveys with more open-ended questions, providing a scalable, cost-effective solution for global health organizations. Despite challenges with closed-source models and computational costs, the study underscores LLMs' potential to enhance data analysis and inform global health policy.
期刊介绍:
International Health is an official journal of the Royal Society of Tropical Medicine and Hygiene. It publishes original, peer-reviewed articles and reviews on all aspects of global health including the social and economic aspects of communicable and non-communicable diseases, health systems research, policy and implementation, and the evaluation of disease control programmes and healthcare delivery solutions.
It aims to stimulate scientific and policy debate and provide a forum for analysis and opinion sharing for individuals and organisations engaged in all areas of global health.