Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.

IF 3.8 2区医学 Q1 MEDICAL LABORATORY TECHNOLOGY

Clinical chemistry and laboratory medicine Pub Date : 2025-04-21 DOI:10.1515/cclm-2025-0089

Wenzheng Han, Chao Wan, Rui Shan, Xudong Xu, Guang Chen, Wenjie Zhou, Yuxuan Yang, Gang Feng, Xiaoning Li, Jianghua Yang, Kai Jin, Qing Chen

{"title":"Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.","authors":"Wenzheng Han, Chao Wan, Rui Shan, Xudong Xu, Guang Chen, Wenjie Zhou, Yuxuan Yang, Gang Feng, Xiaoning Li, Jianghua Yang, Kai Jin, Qing Chen","doi":"10.1515/cclm-2025-0089","DOIUrl":null,"url":null,"abstract":"Objectives: Accurate medical laboratory reports are essential for delivering high-quality healthcare. Recently, advanced artificial intelligence models, such as those in the ChatGPT series, have shown considerable promise in this domain. This study assessed the performance of specific GPT models-namely, 4o, o1, and o1 mini-in identifying errors within medical laboratory reports and in providing treatment recommendations.Methods: In this retrospective study, 86 medical laboratory reports of Nucleic acid test report for the seven upper respiratory tract pathogens were compiled. There were 285 errors from four common error categories intentionally and randomly introduced into reports and generated 86 incorrected reports. GPT models were tasked with detecting these errors, using three senior medical laboratory scientists (SMLS) and three medical laboratory interns (MLI) as control groups. Additionally, GPT models were tasked with generating accurate and reliable treatment recommendations following positive test outcomes based on 86 corrected reports. χ2 tests, Kruskal-Wallis tests, and Wilcoxon tests were used for statistical analysis where appropriate.Results: In comparison with SMLS or MLI, GPT models accurately detected three error types, and the average detection rates of the three GPT models were 88.9 %(omission), 91.6 % (time sequence), and 91.7 % (the same individual acted both as the inspector and the reviewer). However, the average detection rate for errors in the result input format by the three GPT models was only 51.9 %, indicating a relatively poor performance in this aspect. GPT models exhibited substantial to almost perfect agreement with SMLS in detecting total errors (kappa [min, max]: 0.778, 0.837). However, the agreement between GPT models and MLI was moderately lower (kappa [min, max]: 0.632, 0.696). When it comes to reading all 86 reports, GPT models showed obviously reduced reading time compared with SMLS or MLI (all p<0.001). Notably, our study also found the GPT-o1 mini model had better consistency of error identification than the GPT-o1 model, which was better than that of the GPT-4o model. The pairwise comparisons of the same GPT model's outputs across three repeated runs showed almost perfect agreement (kappa [min, max]: 0.912, 0.996). GPT-o1 mini showed obviously reduced reading time compared with GPT-4o or GPT-o1(all p<0.001). Additionally, GPT-o1 significantly outperformed GPT-4o or o1 mini in providing accurate and reliable treatment recommendations (all p<0.0001).Conclusions: The detection capability of some of medical laboratory report errors and the accuracy and reliability of treatment recommendations of GPT models was competent, especially, potentially reducing work hours and enhancing clinical decision-making.","PeriodicalId":10390,"journal":{"name":"Clinical chemistry and laboratory medicine","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical chemistry and laboratory medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1515/cclm-2025-0089","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: Accurate medical laboratory reports are essential for delivering high-quality healthcare. Recently, advanced artificial intelligence models, such as those in the ChatGPT series, have shown considerable promise in this domain. This study assessed the performance of specific GPT models-namely, 4o, o1, and o1 mini-in identifying errors within medical laboratory reports and in providing treatment recommendations.

Methods: In this retrospective study, 86 medical laboratory reports of Nucleic acid test report for the seven upper respiratory tract pathogens were compiled. There were 285 errors from four common error categories intentionally and randomly introduced into reports and generated 86 incorrected reports. GPT models were tasked with detecting these errors, using three senior medical laboratory scientists (SMLS) and three medical laboratory interns (MLI) as control groups. Additionally, GPT models were tasked with generating accurate and reliable treatment recommendations following positive test outcomes based on 86 corrected reports. χ2 tests, Kruskal-Wallis tests, and Wilcoxon tests were used for statistical analysis where appropriate.

Results: In comparison with SMLS or MLI, GPT models accurately detected three error types, and the average detection rates of the three GPT models were 88.9 %(omission), 91.6 % (time sequence), and 91.7 % (the same individual acted both as the inspector and the reviewer). However, the average detection rate for errors in the result input format by the three GPT models was only 51.9 %, indicating a relatively poor performance in this aspect. GPT models exhibited substantial to almost perfect agreement with SMLS in detecting total errors (kappa [min, max]: 0.778, 0.837). However, the agreement between GPT models and MLI was moderately lower (kappa [min, max]: 0.632, 0.696). When it comes to reading all 86 reports, GPT models showed obviously reduced reading time compared with SMLS or MLI (all p<0.001). Notably, our study also found the GPT-o1 mini model had better consistency of error identification than the GPT-o1 model, which was better than that of the GPT-4o model. The pairwise comparisons of the same GPT model's outputs across three repeated runs showed almost perfect agreement (kappa [min, max]: 0.912, 0.996). GPT-o1 mini showed obviously reduced reading time compared with GPT-4o or GPT-o1(all p<0.001). Additionally, GPT-o1 significantly outperformed GPT-4o or o1 mini in providing accurate and reliable treatment recommendations (all p<0.0001).

Conclusions: The detection capability of some of medical laboratory report errors and the accuracy and reliability of treatment recommendations of GPT models was competent, especially, potentially reducing work hours and enhancing clinical decision-making.

查看原文本刊更多论文

使用ChatGPT模型评估核酸检测报告中的错误检测和处理建议。

目的：准确的医学实验室报告对于提供高质量的医疗保健至关重要。最近，先进的人工智能模型，如ChatGPT系列中的模型，在这一领域显示出相当大的前景。本研究评估了特定GPT模型（即40、01和01 mini）在识别医学实验室报告中的错误和提供治疗建议方面的性能。方法：对86份上呼吸道病原菌核酸检测报告进行回顾性分析。从四个常见错误类别中有意和随机引入了285个错误，并生成了86个错误报告。GPT模型的任务是检测这些错误，使用三名高级医学实验室科学家（SMLS）和三名医学实验室实习生（MLI）作为对照组。此外，GPT模型的任务是根据86份修正报告的阳性测试结果生成准确可靠的治疗建议。采用χ2检验、Kruskal-Wallis检验和Wilcoxon检验进行统计分析。结果：与SMLS或MLI相比，GPT模型能准确地检测出三种错误类型，三种GPT模型的平均检出率分别为88.9 %（遗漏）、91.6 %（时间序列）和91.7 %（同一个体同时作为审查员和审查员）。然而，三种GPT模型对结果输入格式错误的平均检出率仅为51.9 %，在这方面的表现相对较差。GPT模型在检测总误差方面与SMLS模型表现出了相当的一致性（kappa [min, max]: 0.778, 0.837）。然而，GPT模型与MLI的一致性较低（kappa [min, max]: 0.632, 0.696）。在阅读全部86份报告时，GPT模型的阅读时间明显少于SMLS和MLI（均为）。结论：GPT模型对部分医学化验报告错误的检测能力和治疗建议的准确性和可靠性是合格的，特别是有可能减少工作时间，提高临床决策能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical chemistry and laboratory medicine 医学-医学实验技术

CiteScore

11.30

自引率

16.20%

发文量

306

审稿时长

3 months

期刊介绍： Clinical Chemistry and Laboratory Medicine (CCLM) publishes articles on novel teaching and training methods applicable to laboratory medicine. CCLM welcomes contributions on the progress in fundamental and applied research and cutting-edge clinical laboratory medicine. It is one of the leading journals in the field, with an impact factor over 3. CCLM is issued monthly, and it is published in print and electronically. CCLM is the official journal of the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) and publishes regularly EFLM recommendations and news. CCLM is the official journal of the National Societies from Austria (ÖGLMKC); Belgium (RBSLM); Germany (DGKL); Hungary (MLDT); Ireland (ACBI); Italy (SIBioC); Portugal (SPML); and Slovenia (SZKK); and it is affiliated to AACB (Australia) and SFBC (France). Topics: - clinical biochemistry - clinical genomics and molecular biology - clinical haematology and coagulation - clinical immunology and autoimmunity - clinical microbiology - drug monitoring and analysis - evaluation of diagnostic biomarkers - disease-oriented topics (cardiovascular disease, cancer diagnostics, diabetes) - new reagents, instrumentation and technologies - new methodologies - reference materials and methods - reference values and decision limits - quality and safety in laboratory medicine - translational laboratory medicine - clinical metrology Follow @cclm_degruyter on Twitter!