Large language models for deductive qualitative content analysis in dementia-focused embedded pragmatic clinical trials: A comparative methodological study.
Jeffrey Turner, Spencer Phillips Hey, Zachary G Baker, Vincent Mor, Jennifer L Sullivan
{"title":"Large language models for deductive qualitative content analysis in dementia-focused embedded pragmatic clinical trials: A comparative methodological study.","authors":"Jeffrey Turner, Spencer Phillips Hey, Zachary G Baker, Vincent Mor, Jennifer L Sullivan","doi":"10.1186/s43058-026-00953-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Thematic coding helps researchers characterize intervention implementation in embedded pragmatic clinical trials (ePCTs), particularly interventions for older adults with dementia and care partners. However, manual coding is time-consuming, requiring multiple researchers. Because implementation science relies on systematic identification of determinants, barriers, and facilitators, advances in Artificial Intelligence (AI), specifically large language models (LLMs), may automate this process meaningfully by accelerating implementation evaluations within ePCTs. We developed and tested an automated algorithm using Chat GPT-4o and Chat GPT-4o-mini to achieve human-level performance coding interview transcripts.</p><p><strong>Methods: </strong>We created a Python-based system that uses LLMs to process and code semi-structured interview transcripts about implementation challenges in translating dementia interventions into healthcare systems. The system matches excerpts to an existing codebook. Multiple iterations, including expert review, were used to refine accuracy and efficiency.</p><p><strong>Results: </strong>The LLM consistently coded more excerpts than humans. In the third iteration (V3), the LLM captured 61.7% of human-coded excerpts, with matching rates reaching as high as 72.6% for individual transcripts. Matching was higher for descriptive codes, 63.7%, than interpretive codes, 57.7%. The LLM identified 206 correct coded excerpts that human coders missed. In the fourth iteration (V4), GPT-4o outperformed GPT-4o-mini: descriptive code matching reached 89% (e.g. \"Site Characteristics\"), compared to 69% for GPT-4o-mini with the R1+R2 85% threshold. GPT-4o showed a weak, but positive correlation (r = 0.230) between transcript word count and matching agreement, while 4o-mini showed a moderate, but negative correlation (r = -0.452). The LLM workflow yielded a 97% reduction in time and a 99% reduction in cost per transcript.</p><p><strong>Conclusion: </strong>This study compared an LLM-powered workflow with human coding for thematic analysis. The LLM aligned strongly with human coders. While error rates necessitate human oversight, time and cost reduction, and ability to identify missed excerpts make it a potentially reliable supplementary tool. Although ePCTs and implementation science share complementary goals, they differ in focus, this flexible approach enhances efficiency and scalability, with acceptable accuracy. It streamlines the qualitative research workflow from outlining to the analysis of implementation processes in real-world settings and may accelerate existing implementation approaches while minimizing implementation resources.</p>","PeriodicalId":73355,"journal":{"name":"Implementation science communications","volume":" ","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Implementation science communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s43058-026-00953-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Thematic coding helps researchers characterize intervention implementation in embedded pragmatic clinical trials (ePCTs), particularly interventions for older adults with dementia and care partners. However, manual coding is time-consuming, requiring multiple researchers. Because implementation science relies on systematic identification of determinants, barriers, and facilitators, advances in Artificial Intelligence (AI), specifically large language models (LLMs), may automate this process meaningfully by accelerating implementation evaluations within ePCTs. We developed and tested an automated algorithm using Chat GPT-4o and Chat GPT-4o-mini to achieve human-level performance coding interview transcripts.
Methods: We created a Python-based system that uses LLMs to process and code semi-structured interview transcripts about implementation challenges in translating dementia interventions into healthcare systems. The system matches excerpts to an existing codebook. Multiple iterations, including expert review, were used to refine accuracy and efficiency.
Results: The LLM consistently coded more excerpts than humans. In the third iteration (V3), the LLM captured 61.7% of human-coded excerpts, with matching rates reaching as high as 72.6% for individual transcripts. Matching was higher for descriptive codes, 63.7%, than interpretive codes, 57.7%. The LLM identified 206 correct coded excerpts that human coders missed. In the fourth iteration (V4), GPT-4o outperformed GPT-4o-mini: descriptive code matching reached 89% (e.g. "Site Characteristics"), compared to 69% for GPT-4o-mini with the R1+R2 85% threshold. GPT-4o showed a weak, but positive correlation (r = 0.230) between transcript word count and matching agreement, while 4o-mini showed a moderate, but negative correlation (r = -0.452). The LLM workflow yielded a 97% reduction in time and a 99% reduction in cost per transcript.
Conclusion: This study compared an LLM-powered workflow with human coding for thematic analysis. The LLM aligned strongly with human coders. While error rates necessitate human oversight, time and cost reduction, and ability to identify missed excerpts make it a potentially reliable supplementary tool. Although ePCTs and implementation science share complementary goals, they differ in focus, this flexible approach enhances efficiency and scalability, with acceptable accuracy. It streamlines the qualitative research workflow from outlining to the analysis of implementation processes in real-world settings and may accelerate existing implementation approaches while minimizing implementation resources.