Large language models for deductive qualitative content analysis in dementia-focused embedded pragmatic clinical trials: A comparative methodological study.

IF 3.3

Implementation science communications Pub Date : 2026-05-06 DOI:10.1186/s43058-026-00953-8

Jeffrey Turner, Spencer Phillips Hey, Zachary G Baker, Vincent Mor, Jennifer L Sullivan

{"title":"Large language models for deductive qualitative content analysis in dementia-focused embedded pragmatic clinical trials: A comparative methodological study.","authors":"Jeffrey Turner, Spencer Phillips Hey, Zachary G Baker, Vincent Mor, Jennifer L Sullivan","doi":"10.1186/s43058-026-00953-8","DOIUrl":null,"url":null,"abstract":"Introduction: Thematic coding helps researchers characterize intervention implementation in embedded pragmatic clinical trials (ePCTs), particularly interventions for older adults with dementia and care partners. However, manual coding is time-consuming, requiring multiple researchers. Because implementation science relies on systematic identification of determinants, barriers, and facilitators, advances in Artificial Intelligence (AI), specifically large language models (LLMs), may automate this process meaningfully by accelerating implementation evaluations within ePCTs. We developed and tested an automated algorithm using Chat GPT-4o and Chat GPT-4o-mini to achieve human-level performance coding interview transcripts.Methods: We created a Python-based system that uses LLMs to process and code semi-structured interview transcripts about implementation challenges in translating dementia interventions into healthcare systems. The system matches excerpts to an existing codebook. Multiple iterations, including expert review, were used to refine accuracy and efficiency.Results: The LLM consistently coded more excerpts than humans. In the third iteration (V3), the LLM captured 61.7% of human-coded excerpts, with matching rates reaching as high as 72.6% for individual transcripts. Matching was higher for descriptive codes, 63.7%, than interpretive codes, 57.7%. The LLM identified 206 correct coded excerpts that human coders missed. In the fourth iteration (V4), GPT-4o outperformed GPT-4o-mini: descriptive code matching reached 89% (e.g. \"Site Characteristics\"), compared to 69% for GPT-4o-mini with the R1+R2 85% threshold. GPT-4o showed a weak, but positive correlation (r = 0.230) between transcript word count and matching agreement, while 4o-mini showed a moderate, but negative correlation (r = -0.452). The LLM workflow yielded a 97% reduction in time and a 99% reduction in cost per transcript.Conclusion: This study compared an LLM-powered workflow with human coding for thematic analysis. The LLM aligned strongly with human coders. While error rates necessitate human oversight, time and cost reduction, and ability to identify missed excerpts make it a potentially reliable supplementary tool. Although ePCTs and implementation science share complementary goals, they differ in focus, this flexible approach enhances efficiency and scalability, with acceptable accuracy. It streamlines the qualitative research workflow from outlining to the analysis of implementation processes in real-world settings and may accelerate existing implementation approaches while minimizing implementation resources.","PeriodicalId":73355,"journal":{"name":"Implementation science communications","volume":" ","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2026-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Implementation science communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s43058-026-00953-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Thematic coding helps researchers characterize intervention implementation in embedded pragmatic clinical trials (ePCTs), particularly interventions for older adults with dementia and care partners. However, manual coding is time-consuming, requiring multiple researchers. Because implementation science relies on systematic identification of determinants, barriers, and facilitators, advances in Artificial Intelligence (AI), specifically large language models (LLMs), may automate this process meaningfully by accelerating implementation evaluations within ePCTs. We developed and tested an automated algorithm using Chat GPT-4o and Chat GPT-4o-mini to achieve human-level performance coding interview transcripts.

Methods: We created a Python-based system that uses LLMs to process and code semi-structured interview transcripts about implementation challenges in translating dementia interventions into healthcare systems. The system matches excerpts to an existing codebook. Multiple iterations, including expert review, were used to refine accuracy and efficiency.

Results: The LLM consistently coded more excerpts than humans. In the third iteration (V3), the LLM captured 61.7% of human-coded excerpts, with matching rates reaching as high as 72.6% for individual transcripts. Matching was higher for descriptive codes, 63.7%, than interpretive codes, 57.7%. The LLM identified 206 correct coded excerpts that human coders missed. In the fourth iteration (V4), GPT-4o outperformed GPT-4o-mini: descriptive code matching reached 89% (e.g. "Site Characteristics"), compared to 69% for GPT-4o-mini with the R1+R2 85% threshold. GPT-4o showed a weak, but positive correlation (r = 0.230) between transcript word count and matching agreement, while 4o-mini showed a moderate, but negative correlation (r = -0.452). The LLM workflow yielded a 97% reduction in time and a 99% reduction in cost per transcript.

Conclusion: This study compared an LLM-powered workflow with human coding for thematic analysis. The LLM aligned strongly with human coders. While error rates necessitate human oversight, time and cost reduction, and ability to identify missed excerpts make it a potentially reliable supplementary tool. Although ePCTs and implementation science share complementary goals, they differ in focus, this flexible approach enhances efficiency and scalability, with acceptable accuracy. It streamlines the qualitative research workflow from outlining to the analysis of implementation processes in real-world settings and may accelerate existing implementation approaches while minimizing implementation resources.

查看原文本刊更多论文

在以痴呆症为重点的嵌入式实用临床试验中，用于演绎定性内容分析的大型语言模型：一项比较方法学研究。

主题编码帮助研究人员描述嵌入式实用临床试验（epct）的干预实施，特别是对老年痴呆患者和护理伙伴的干预。然而，手工编码是耗时的，需要多个研究人员。由于实施科学依赖于对决定因素、障碍和促进因素的系统识别，人工智能（AI）的进步，特别是大型语言模型（llm），可以通过加速epct内的实施评估，使这一过程有意义地自动化。我们使用Chat gpt - 40和Chat gpt - 40 -mini开发并测试了一种自动化算法，以实现人类水平的性能编码面试成绩单。方法：我们创建了一个基于python的系统，该系统使用llm来处理和编码关于将痴呆干预措施转化为医疗保健系统的实施挑战的半结构化访谈记录。系统将摘录与现有的代码本进行匹配。包括专家评审在内的多次迭代被用于提高准确性和效率。结果：法学硕士始终比人类编码更多的摘录。在第三次迭代（V3）中，LLM捕获了61.7%的人工编码摘录，单个转录本的匹配率高达72.6%。描述性编码的匹配率为63.7%，高于解释性编码的57.7%。LLM识别出人类编码员遗漏的206个正确的编码摘录。在第四次迭代（V4）中，gpt - 40优于gpt - 40 -mini：描述代码匹配达到89%(例如；gpt - 40 -mini的R1+R2阈值为85%，相比之下为69%。gpt - 40字数与匹配度呈弱正相关（r = 0.230），而40 -mini字数与匹配度呈弱负相关（r = -0.452）。LLM工作流程减少了97%的时间和99%的成本。结论：本研究比较了llm支持的工作流程与人类编码的主题分析。LLM与人类编码员非常一致。虽然错误率需要人工监督，但时间和成本的减少以及识别遗漏摘录的能力使其成为潜在的可靠补充工具。尽管epct和实施科学有着互补的目标，但它们的重点不同，这种灵活的方法提高了效率和可扩展性，并具有可接受的准确性。它简化了定性研究工作流程，从概述到分析现实环境中的实施过程，并可能加速现有的实施方法，同时最大限度地减少实施资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊