Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses.

JMIR AI Pub Date : 2024-08-02 DOI:10.2196/54482
Maximo R Prescott, Samantha Yeager, Lillian Ham, Carlos D Rivera Saldana, Vanessa Serrano, Joey Narez, Dafna Paltin, Jorge Delgado, David J Moore, Jessica Montoya
{"title":"Comparing the Efficacy and Efficiency of Human and Generative AI: Qualitative Thematic Analyses.","authors":"Maximo R Prescott, Samantha Yeager, Lillian Ham, Carlos D Rivera Saldana, Vanessa Serrano, Joey Narez, Dafna Paltin, Jorge Delgado, David J Moore, Jessica Montoya","doi":"10.2196/54482","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown.</p><p><strong>Objective: </strong>The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders.</p><p><strong>Methods: </strong>The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared.</p><p><strong>Results: </strong>The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min).</p><p><strong>Conclusions: </strong>The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e54482"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11329846/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/54482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Qualitative methods are incredibly beneficial to the dissemination and implementation of new digital health interventions; however, these methods can be time intensive and slow down dissemination when timely knowledge from the data sources is needed in ever-changing health systems. Recent advancements in generative artificial intelligence (GenAI) and their underlying large language models (LLMs) may provide a promising opportunity to expedite the qualitative analysis of textual data, but their efficacy and reliability remain unknown.

Objective: The primary objectives of our study were to evaluate the consistency in themes, reliability of coding, and time needed for inductive and deductive thematic analyses between GenAI (ie, ChatGPT and Bard) and human coders.

Methods: The qualitative data for this study consisted of 40 brief SMS text message reminder prompts used in a digital health intervention for promoting antiretroviral medication adherence among people with HIV who use methamphetamine. Inductive and deductive thematic analyses of these SMS text messages were conducted by 2 independent teams of human coders. An independent human analyst conducted analyses following both approaches using ChatGPT and Bard. The consistency in themes (or the extent to which the themes were the same) and reliability (or agreement in coding of themes) between methods were compared.

Results: The themes generated by GenAI (both ChatGPT and Bard) were consistent with 71% (5/7) of the themes identified by human analysts following inductive thematic analysis. The consistency in themes was lower between humans and GenAI following a deductive thematic analysis procedure (ChatGPT: 6/12, 50%; Bard: 7/12, 58%). The percentage agreement (or intercoder reliability) for these congruent themes between human coders and GenAI ranged from fair to moderate (ChatGPT, inductive: 31/66, 47%; ChatGPT, deductive: 22/59, 37%; Bard, inductive: 20/54, 37%; Bard, deductive: 21/58, 36%). In general, ChatGPT and Bard performed similarly to each other across both types of qualitative analyses in terms of consistency of themes (inductive: 6/6, 100%; deductive: 5/6, 83%) and reliability of coding (inductive: 23/62, 37%; deductive: 22/47, 47%). On average, GenAI required significantly less overall time than human coders when conducting qualitative analysis (20, SD 3.5 min vs 567, SD 106.5 min).

Conclusions: The promising consistency in the themes generated by human coders and GenAI suggests that these technologies hold promise in reducing the resource intensiveness of qualitative thematic analysis; however, the relatively lower reliability in coding between them suggests that hybrid approaches are necessary. Human coders appeared to be better than GenAI at identifying nuanced and interpretative themes. Future studies should consider how these powerful technologies can be best used in collaboration with human coders to improve the efficiency of qualitative research in hybrid approaches while also mitigating potential ethical risks that they may pose.

比较人工智能和生成式人工智能的功效和效率:定性专题分析。
背景:定性方法对传播和实施新的数字健康干预措施大有裨益;然而,在瞬息万变的健康系统中,需要及时从数据源中获得知识时,这些方法可能会耗费大量时间,并减缓传播速度。生成式人工智能(GenAI)及其底层大型语言模型(LLMs)的最新进展可能为加快文本数据的定性分析提供了一个大有可为的机会,但其有效性和可靠性仍是未知数:我们研究的主要目的是评估 GenAI(即 ChatGPT 和 Bard)与人类编码员之间的主题一致性、编码可靠性以及归纳和演绎主题分析所需的时间:本研究的定性数据由 40 条简短的短信提示组成,这些短信提示用于数字健康干预,以促进吸食甲基苯丙胺的艾滋病病毒感染者坚持服用抗逆转录病毒药物。这些短信的归纳和演绎主题分析由两组独立的人工编码人员进行。一名独立的人工分析师使用 ChatGPT 和 Bard 按照这两种方法进行分析。比较了两种方法的主题一致性(或主题相同的程度)和可靠性(或主题编码的一致性):结果:GenAI(ChatGPT 和 Bard)生成的主题与人类分析师在归纳式主题分析后确定的主题的 71%(5/7)一致。在演绎式主题分析程序中,人类和 GenAI 的主题一致性较低(ChatGPT:6/12,50%;Bard:7/12,58%)。人类编码员和 GenAI 之间在这些一致主题上的一致性百分比(或编码员间可靠性)从一般到中等不等(ChatGPT,归纳法:31/66,47%;ChatGPT,演绎法:22/59,37%;Bard,归纳法:20/54,37%;Bard,演绎法:21/58,36%)。总体而言,在主题一致性(归纳法:6/6,100%;演绎法:5/6,83%)和编码可靠性(归纳法:23/62,37%;演绎法:22/47,47%)方面,ChatGPT 和 Bard 在两类定性分析中的表现相似。平均而言,在进行定性分析时,GenAI 所需的总体时间大大少于人类编码员(20 分钟,标差 3.5 分钟;567 分钟,标差 106.5 分钟):人类编码员和 GenAI 所生成的主题具有良好的一致性,这表明这些技术有望降低定性主题分析的资源密集度;然而,它们之间编码的可靠性相对较低,这表明有必要采用混合方法。在识别细微和解释性主题方面,人工编码者似乎比 GenAI 更胜一筹。未来的研究应考虑如何将这些强大的技术与人类编码员合作使用,以提高混合方法中定性研究的效率,同时降低它们可能带来的潜在伦理风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信