AI in Qualitative Health Research Appraisal: Comparative Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES
August Landerholm
{"title":"AI in Qualitative Health Research Appraisal: Comparative Study.","authors":"August Landerholm","doi":"10.2196/72815","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Qualitative research appraisal is crucial for ensuring credible findings but faces challenges due to human variability. Artificial intelligence (AI) models have the potential to enhance the efficiency and consistency of qualitative research assessments.</p><p><strong>Objective: </strong>This study aims to evaluate the performance of 5 AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, and Claude 3 Opus) in assessing the quality of qualitative research using 3 standardized tools: Critical Appraisal Skills Programme (CASP), Joanna Briggs Institute (JBI) checklist, and Evaluative Tools for Qualitative Studies (ETQS).</p><p><strong>Methods: </strong>AI-generated assessments of 3 peer-reviewed qualitative papers in health and physical activity-related research were analyzed. The study examined systematic affirmation bias, interrater reliability, and tool-dependent disagreements across the AI models. Sensitivity analysis was conducted to evaluate the impact of excluding specific models on agreement levels.</p><p><strong>Results: </strong>Results revealed a systematic affirmation bias across all AI models, with \"Yes\" rates ranging from 75.9% (145/191; Claude 3 Opus) to 85.4% (164/192; Claude 3.5). GPT-4 diverged significantly, showing lower agreement (\"Yes\": 115/192, 59.9%) and higher uncertainty (\"Cannot tell\": 69/192, 35.9%). Proprietary models (GPT-3.5 and Claude 3.5) demonstrated near-perfect alignment (Cramer V=0.891; P<.001), while open-source models showed greater variability. Interrater reliability varied by assessment tool, with CASP achieving the highest baseline consensus (Krippendorff α=0.653), followed by JBI (α=0.477), and ETQS scoring lowest (α=0.376). Sensitivity analysis revealed that excluding GPT-4 increased CASP agreement by 20% (α=0.784), while removing Sonar Huge improved JBI agreement by 18% (α=0.561). ETQS showed marginal improvements when excluding GPT-4 or Claude 3 Opus (+9%, α=0.409). Tool-dependent disagreements were evident, particularly in ETQS criteria, highlighting AI's current limitations in contextual interpretation.</p><p><strong>Conclusions: </strong>The findings demonstrate that AI models exhibit both promise and limitations as evaluators of qualitative research quality. While they enhance efficiency, AI models struggle with reaching consensus in areas requiring nuanced interpretation, particularly for contextual criteria. The study underscores the importance of hybrid frameworks that integrate AI scalability with human oversight, especially for contextual judgment. Future research should prioritize developing AI training protocols that emphasize qualitative epistemology, benchmarking AI performance against expert panels to validate accuracy thresholds, and establishing ethical guidelines for disclosing AI's role in systematic reviews. As qualitative methodologies evolve alongside AI capabilities, the path forward lies in collaborative human-AI workflows that leverage AI's efficiency while preserving human expertise for interpretive tasks.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e72815"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12263093/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/72815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Qualitative research appraisal is crucial for ensuring credible findings but faces challenges due to human variability. Artificial intelligence (AI) models have the potential to enhance the efficiency and consistency of qualitative research assessments.

Objective: This study aims to evaluate the performance of 5 AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, and Claude 3 Opus) in assessing the quality of qualitative research using 3 standardized tools: Critical Appraisal Skills Programme (CASP), Joanna Briggs Institute (JBI) checklist, and Evaluative Tools for Qualitative Studies (ETQS).

Methods: AI-generated assessments of 3 peer-reviewed qualitative papers in health and physical activity-related research were analyzed. The study examined systematic affirmation bias, interrater reliability, and tool-dependent disagreements across the AI models. Sensitivity analysis was conducted to evaluate the impact of excluding specific models on agreement levels.

Results: Results revealed a systematic affirmation bias across all AI models, with "Yes" rates ranging from 75.9% (145/191; Claude 3 Opus) to 85.4% (164/192; Claude 3.5). GPT-4 diverged significantly, showing lower agreement ("Yes": 115/192, 59.9%) and higher uncertainty ("Cannot tell": 69/192, 35.9%). Proprietary models (GPT-3.5 and Claude 3.5) demonstrated near-perfect alignment (Cramer V=0.891; P<.001), while open-source models showed greater variability. Interrater reliability varied by assessment tool, with CASP achieving the highest baseline consensus (Krippendorff α=0.653), followed by JBI (α=0.477), and ETQS scoring lowest (α=0.376). Sensitivity analysis revealed that excluding GPT-4 increased CASP agreement by 20% (α=0.784), while removing Sonar Huge improved JBI agreement by 18% (α=0.561). ETQS showed marginal improvements when excluding GPT-4 or Claude 3 Opus (+9%, α=0.409). Tool-dependent disagreements were evident, particularly in ETQS criteria, highlighting AI's current limitations in contextual interpretation.

Conclusions: The findings demonstrate that AI models exhibit both promise and limitations as evaluators of qualitative research quality. While they enhance efficiency, AI models struggle with reaching consensus in areas requiring nuanced interpretation, particularly for contextual criteria. The study underscores the importance of hybrid frameworks that integrate AI scalability with human oversight, especially for contextual judgment. Future research should prioritize developing AI training protocols that emphasize qualitative epistemology, benchmarking AI performance against expert panels to validate accuracy thresholds, and establishing ethical guidelines for disclosing AI's role in systematic reviews. As qualitative methodologies evolve alongside AI capabilities, the path forward lies in collaborative human-AI workflows that leverage AI's efficiency while preserving human expertise for interpretive tasks.

人工智能在定性健康研究评价中的应用:比较研究。
背景:定性研究评估对于确保可信的研究结果至关重要,但由于人类的可变性而面临挑战。人工智能(AI)模型有可能提高定性研究评估的效率和一致性。目的:本研究旨在评估5个人工智能模型(GPT-3.5、Claude 3.5、Sonar Huge、GPT-4和Claude 3 Opus)在使用3种标准化工具评估定性研究质量方面的表现:关键评估技能计划(CASP)、乔安娜布里格斯研究所(JBI)检查表和定性研究评估工具(ETQS)。方法:对3篇健康与体育活动相关研究的同行评议定性论文进行人工智能生成的评价分析。该研究考察了人工智能模型中的系统性肯定偏差、解释器间可靠性和工具依赖分歧。进行敏感性分析以评估排除特定模型对协议水平的影响。结果显示,所有人工智能模型都存在系统性的肯定偏见,“是”的比例从75.9% (145/191;克劳德3作品)至85.4% (164/192;克劳德3.5)。GPT-4分歧明显,一致性较低(“Yes”:115/192,59.9%),不确定性较高(“Cannot tell”:69/192,35.9%)。专有模型(GPT-3.5和Claude 3.5)显示出近乎完美的对齐(Cramer V=0.891;结论:研究结果表明,人工智能模型作为定性研究质量的评估者既有希望也有局限性。虽然它们提高了效率,但人工智能模型在需要细致入微的解释的领域(尤其是上下文标准)难以达成共识。该研究强调了将人工智能可扩展性与人类监督相结合的混合框架的重要性,特别是在上下文判断方面。未来的研究应优先开发强调定性认识论的人工智能训练协议,将人工智能的表现与专家小组进行基准测试,以验证准确性阈值,并建立披露人工智能在系统评估中的作用的伦理准则。随着定性方法与人工智能能力一起发展,未来的道路在于人类与人工智能的协作工作流程,利用人工智能的效率,同时保留人类专业知识来完成解释性任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信