大型语言模型能判断孩子的陈述吗？： ChatGPT与人类专家在可信度评估中的比较分析。

IF 1.4

Journal of evidence-based social work (2019) Pub Date : 2025-08-11 DOI:10.1080/26408066.2025.2547211

Zeki Karataş

{"title":"大型语言模型能判断孩子的陈述吗？： ChatGPT与人类专家在可信度评估中的比较分析。","authors":"Zeki Karataş","doi":"10.1080/26408066.2025.2547211","DOIUrl":null,"url":null,"abstract":"Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.Materials and methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (N = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing \"good\" to \"excellent\" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into \"poor\" or negative ranges (e.g. ICC = -.106 for \"Logical structure\"), indicating its evaluation logic fundamentally differs from expert reasoning.Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a \"cognitive assistant\" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.","PeriodicalId":73742,"journal":{"name":"Journal of evidence-based social work (2019)","volume":" ","pages":"1-16"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.\",\"authors\":\"Zeki Karataş\",\"doi\":\"10.1080/26408066.2025.2547211\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.Materials and methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (N = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing \\\"good\\\" to \\\"excellent\\\" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into \\\"poor\\\" or negative ranges (e.g. ICC = -.106 for \\\"Logical structure\\\"), indicating its evaluation logic fundamentally differs from expert reasoning.Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a \\\"cognitive assistant\\\" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.\",\"PeriodicalId\":73742,\"journal\":{\"name\":\"Journal of evidence-based social work (2019)\",\"volume\":\" \",\"pages\":\"1-16\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of evidence-based social work (2019)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/26408066.2025.2547211\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evidence-based social work (2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/26408066.2025.2547211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目的：研究人类专家（法医心理学家和社会工作者）与大语言模型（LLM）在评估儿童性虐待陈述中的信度。该研究旨在探索这类人工智能作为基于标准的内容分析（CBCA）框架下的评估工具的潜力、局限性和一致性，CBCA是一种广泛使用的评估语句可信度的方法。材料和方法：65份儿童性虐待受害者的法医访谈笔录（N = 65）由法医心理学家、社会工作者和大型语言模型（ChatGPT、gpt - 40 Plus）三位评估者独立评估。每个语句使用19项CBCA框架进行编码。采用类内相关系数（Intraclass Correlation Coefficient， ICC）、科恩Kappa （Cohen’s Kappa, κ）等协议统计量对人-人配对和人-人工智能配对的判断进行了信度分析。结果：在两位人类专家之间发现了高度的评级可靠性，大多数标准显示“良好”到“优秀”的一致性（19个标准中有15个与ICC bb0.75一致）。与此形成鲜明对比的是，人工智能模型的评估与人类专家的评估相比，可靠性显著下降。人工智能在需要细致入微的上下文判断的标准上表现出系统性的分歧，可靠性系数经常落入“差”或负范围（例如ICC = -）。106“逻辑结构”)，表明其评价逻辑与专家推理有本质区别。讨论：研究结果揭示了人类专家细致入微的上下文推理与LLM测试的模式识别能力之间的深刻差距。该研究的结论是，这种类型的人工智能，以其目前的快速工程形式，无法可靠地在可信度评估的复杂任务中复制专家的判断。虽然它不是一个可行的自主评估器，但它可能具有作为支持专家工作流程的“认知助手”的潜力。评估儿童证词的可信度仍然是一项非常需要专业判断的任务，似乎远远超出了这种生成式人工智能模型目前的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.

Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.

Materials and methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (N = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.

Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing "good" to "excellent" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into "poor" or negative ranges (e.g. ICC = -.106 for "Logical structure"), indicating its evaluation logic fundamentally differs from expert reasoning.

Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a "cognitive assistant" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of evidence-based social work (2019)

自引率

0.00%

发文量