比较人工智能系统和临床毒理学家对中毒问题的回答：它们的答案可以区分吗？

Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias Pub Date : 2024-10-01 DOI:10.55633/s3me/082.2024

Santiago Nogué-Xarau, José Ríos-Guillermo, Montserrat Amigó-Tadín

{"title":"比较人工智能系统和临床毒理学家对中毒问题的回答：它们的答案可以区分吗？","authors":"Santiago Nogué-Xarau, José Ríos-Guillermo, Montserrat Amigó-Tadín","doi":"10.55633/s3me/082.2024","DOIUrl":null,"url":null,"abstract":"Objective: To present questions about poisoning to 4 artificial intelligence (AI) systems and 4 clinical toxicologists and determine whether readers can identify the source of the answers. To evaluate and compare text quality and level of knowledge found in the AI and toxicologists' responses.Methods: Ten questions about toxicology were presented to the following AI systems: Copilot, Bard, Luzia, and ChatGPT. Four clinical toxicologists were asked to answer the same questions. Twenty-four recruited experts in toxicology were sent a pair of answers (1 from an AI system and one from a toxicologist) for each of the 10 questions. For each answer, the experts had to identify the source, evaluate text quality, and assess level of knowledge reflected. Quantitative variables were described as mean (SD) and qualitative ones as absolute frequency and proportion. A value of P .05 was considered significant in all comparisons.Results: Of the 240 evaluated AI answers, the expert evaluators thought that 21 (8.8%) and 38 (15.8%), respectively, were certainly or probably written by a toxicologist. The experts were unable to guess the source of 13 (5.4%) AI answers. Luzia and ChatGPT were better able to mislead the experts than Bard (P = .036 and P = .041, respectively). Text quality was judged excellent in 38.8% of the AI answers. ChatGPT text quality was rated highest (61.3% excellent) vs Bard (34.4%), Luzia (31.7%), and Copilot (26.3%) (P .001, all comparisons). The average score for the level of knowledge perceived in the AI answers was 7.23 (1.57) out of 10. The highest average score was achieved by ChatGPT at 8.03 (1.26) vs Luzia (7.02 [1,63]), Bard (6.91 [1.64]), and Copilot (6.91 [1.46]) (P .001, all comparisons).Conclusions: Luzia and ChatGPT answers to the toxicology questions were often thought to resemble those of clinical toxicologists. ChatGPT answers were judged to be very well-written and reflect a very high level of knowledge.","PeriodicalId":93987,"journal":{"name":"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias","volume":"36 5","pages":"351-358"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished?\",\"authors\":\"Santiago Nogué-Xarau, José Ríos-Guillermo, Montserrat Amigó-Tadín\",\"doi\":\"10.55633/s3me/082.2024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: To present questions about poisoning to 4 artificial intelligence (AI) systems and 4 clinical toxicologists and determine whether readers can identify the source of the answers. To evaluate and compare text quality and level of knowledge found in the AI and toxicologists' responses.Methods: Ten questions about toxicology were presented to the following AI systems: Copilot, Bard, Luzia, and ChatGPT. Four clinical toxicologists were asked to answer the same questions. Twenty-four recruited experts in toxicology were sent a pair of answers (1 from an AI system and one from a toxicologist) for each of the 10 questions. For each answer, the experts had to identify the source, evaluate text quality, and assess level of knowledge reflected. Quantitative variables were described as mean (SD) and qualitative ones as absolute frequency and proportion. A value of P .05 was considered significant in all comparisons.Results: Of the 240 evaluated AI answers, the expert evaluators thought that 21 (8.8%) and 38 (15.8%), respectively, were certainly or probably written by a toxicologist. The experts were unable to guess the source of 13 (5.4%) AI answers. Luzia and ChatGPT were better able to mislead the experts than Bard (P = .036 and P = .041, respectively). Text quality was judged excellent in 38.8% of the AI answers. ChatGPT text quality was rated highest (61.3% excellent) vs Bard (34.4%), Luzia (31.7%), and Copilot (26.3%) (P .001, all comparisons). The average score for the level of knowledge perceived in the AI answers was 7.23 (1.57) out of 10. The highest average score was achieved by ChatGPT at 8.03 (1.26) vs Luzia (7.02 [1,63]), Bard (6.91 [1.64]), and Copilot (6.91 [1.46]) (P .001, all comparisons).Conclusions: Luzia and ChatGPT answers to the toxicology questions were often thought to resemble those of clinical toxicologists. ChatGPT answers were judged to be very well-written and reflect a very high level of knowledge.\",\"PeriodicalId\":93987,\"journal\":{\"name\":\"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias\",\"volume\":\"36 5\",\"pages\":\"351-358\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.55633/s3me/082.2024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55633/s3me/082.2024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目的：向 4 个人工智能（AI）系统和 4 位临床毒理学家提出有关中毒的问题，并确定读者能否识别答案的来源。评估并比较人工智能和毒理学专家回答的文本质量和知识水平：方法：向以下人工智能系统提出 10 个有关毒理学的问题：Copilot、Bard、Luzia 和 ChatGPT。四位临床毒理学专家被要求回答同样的问题。二十四位受聘的毒理学专家就这 10 个问题的每个问题都收到了一对答案（一个来自人工智能系统，一个来自毒理学专家）。对于每个答案，专家们必须确定来源、评估文本质量并评估所反映的知识水平。定量变量以平均值（SD）表示，定性变量以绝对频率和比例表示。所有比较均以 P .05 为显著值：结果：在 240 份经评估的人工智能答案中，专家评估员认为肯定或可能由毒理学家撰写的答案分别为 21 份（8.8%）和 38 份（15.8%）。专家们无法猜测 13 条（5.4%）人工智能答案的来源。Luzia 和 ChatGPT 比 Bard 更能误导专家（P = .036 和 P = .041）。38.8%的人工智能答案的文本质量被评为优秀。与 Bard（34.4%）、Luzia（31.7%）和 Copilot（26.3%）相比，ChatGPT 的文本质量被评为最高（61.3% 为优秀）（P.001，所有比较）。在人工智能答案中感知到的知识水平的平均得分为 7.23 (1.57)（满分 10 分）。与 Luzia（7.02 [1,63]）、Bard（6.91 [1.64]）和 Copilot（6.91 [1.46]）相比，ChatGPT 的平均得分最高，为 8.03 (1.26)（P.001，所有比较）：人们通常认为，Luzia 和 ChatGPT 对毒理学问题的回答与临床毒理学专家的回答相似。ChatGPT 的答案被认为写得很好，反映了很高的知识水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing answers of artificial intelligence systems and clinical toxicologists to questions about poisoning: Can their answers be distinguished?

Objective: To present questions about poisoning to 4 artificial intelligence (AI) systems and 4 clinical toxicologists and determine whether readers can identify the source of the answers. To evaluate and compare text quality and level of knowledge found in the AI and toxicologists' responses.

Methods: Ten questions about toxicology were presented to the following AI systems: Copilot, Bard, Luzia, and ChatGPT. Four clinical toxicologists were asked to answer the same questions. Twenty-four recruited experts in toxicology were sent a pair of answers (1 from an AI system and one from a toxicologist) for each of the 10 questions. For each answer, the experts had to identify the source, evaluate text quality, and assess level of knowledge reflected. Quantitative variables were described as mean (SD) and qualitative ones as absolute frequency and proportion. A value of P .05 was considered significant in all comparisons.

Results: Of the 240 evaluated AI answers, the expert evaluators thought that 21 (8.8%) and 38 (15.8%), respectively, were certainly or probably written by a toxicologist. The experts were unable to guess the source of 13 (5.4%) AI answers. Luzia and ChatGPT were better able to mislead the experts than Bard (P = .036 and P = .041, respectively). Text quality was judged excellent in 38.8% of the AI answers. ChatGPT text quality was rated highest (61.3% excellent) vs Bard (34.4%), Luzia (31.7%), and Copilot (26.3%) (P .001, all comparisons). The average score for the level of knowledge perceived in the AI answers was 7.23 (1.57) out of 10. The highest average score was achieved by ChatGPT at 8.03 (1.26) vs Luzia (7.02 [1,63]), Bard (6.91 [1.64]), and Copilot (6.91 [1.46]) (P .001, all comparisons).

Conclusions: Luzia and ChatGPT answers to the toxicology questions were often thought to resemble those of clinical toxicologists. ChatGPT answers were judged to be very well-written and reflect a very high level of knowledge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Emergencias : revista de la Sociedad Espanola de Medicina de Emergencias

自引率

0.00%

发文量