Interobserver agreement between artificial intelligence models in the thyroid imaging and reporting data system (TIRADS) assessment of thyroid nodules.

IF 3 3区医学 Q2 ENDOCRINOLOGY & METABOLISM

Endocrine Pub Date : 2025-07-01 Epub Date: 2025-05-15 DOI:10.1007/s12020-025-04272-1

Andrea Leoncini, Pierpaolo Trimboli

{"title":"Interobserver agreement between artificial intelligence models in the thyroid imaging and reporting data system (TIRADS) assessment of thyroid nodules.","authors":"Andrea Leoncini, Pierpaolo Trimboli","doi":"10.1007/s12020-025-04272-1","DOIUrl":null,"url":null,"abstract":"Background: As ultrasound (US) is the most accurate tool for assessing the thyroid nodule (TN) risk of malignancy (RoM), international societies have published various Thyroid Imaging and Reporting Data Systems (TIRADSs). With the recent advent of artificial intelligence (AI), clinicians and researchers should ask themselves how AI could interpret the terminology of the TIRADSs and whether or not AIs agree in the risk assessment of TNs. The study aim was to analyze the interobserver agreement (IOA) between AIs in assessing the RoM of TNs across various TIRADSs categories using a cases series created combining TIRADSs descriptors.Methods: ChatGPT, Google Gemini, and Claude were compared. ACR-TIRADS, EU-TIRADS, and K-TIRADS, were employed to evaluate the AI assessment. Multiple written scenarios for the three TIRADS were created, the cases were evaluated by the three AIs, and their assessments were analyzed and compared. The IOA was estimated by comparing the kappa (κ) values.Results: Ninety scenarios were created. With ACR-TIRADS the IOA analysis gave κ = 0.58 between ChatGPT and Gemini, 0.53 between ChatGPT and Claude, and 0.90 between Gemini and Claude. With EU-TIRADS it was observed κ value = 0.73 between ChatGPT and Gemini, 0.62 between ChatGPT and Claude, and 0.72 between Gemini and Claude. With K-TIRADS it was found κ = 0.88 between ChatGPT and Gemini, 0.70 between ChatGPT and Claude, and 0.61 between Gemini and Claude.Conclusion: This study found that there were non-negligible variability between the three AIs. Clinicians and patients should be aware of these new findings.","PeriodicalId":49211,"journal":{"name":"Endocrine","volume":" ","pages":"197-201"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Endocrine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s12020-025-04272-1","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/15 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Background: As ultrasound (US) is the most accurate tool for assessing the thyroid nodule (TN) risk of malignancy (RoM), international societies have published various Thyroid Imaging and Reporting Data Systems (TIRADSs). With the recent advent of artificial intelligence (AI), clinicians and researchers should ask themselves how AI could interpret the terminology of the TIRADSs and whether or not AIs agree in the risk assessment of TNs. The study aim was to analyze the interobserver agreement (IOA) between AIs in assessing the RoM of TNs across various TIRADSs categories using a cases series created combining TIRADSs descriptors.

Methods: ChatGPT, Google Gemini, and Claude were compared. ACR-TIRADS, EU-TIRADS, and K-TIRADS, were employed to evaluate the AI assessment. Multiple written scenarios for the three TIRADS were created, the cases were evaluated by the three AIs, and their assessments were analyzed and compared. The IOA was estimated by comparing the kappa (κ) values.

Results: Ninety scenarios were created. With ACR-TIRADS the IOA analysis gave κ = 0.58 between ChatGPT and Gemini, 0.53 between ChatGPT and Claude, and 0.90 between Gemini and Claude. With EU-TIRADS it was observed κ value = 0.73 between ChatGPT and Gemini, 0.62 between ChatGPT and Claude, and 0.72 between Gemini and Claude. With K-TIRADS it was found κ = 0.88 between ChatGPT and Gemini, 0.70 between ChatGPT and Claude, and 0.61 between Gemini and Claude.

Conclusion: This study found that there were non-negligible variability between the three AIs. Clinicians and patients should be aware of these new findings.

查看原文本刊更多论文

人工智能模型在甲状腺结节成像和报告数据系统（TIRADS）评估中的观察者间一致性。

背景：由于超声（US）是评估甲状腺结节（TN）恶性肿瘤（RoM）风险最准确的工具，国际社会已经发布了各种甲状腺成像和报告数据系统（tirads）。随着人工智能（AI）的出现，临床医生和研究人员应该问自己，人工智能如何解释tirads的术语，以及人工智能是否同意tnn的风险评估。本研究的目的是通过结合tirads描述符创建的病例系列，分析人工智能在评估不同tirads类别中tnn的RoM时的观察者间协议（IOA）。方法：对ChatGPT、谷歌Gemini和Claude进行比较。采用ACR-TIRADS、EU-TIRADS和K-TIRADS进行人工智能评估。为三个TIRADS创建多个书面场景，由三个人工智能对案例进行评估，并对其评估结果进行分析和比较。通过比较kappa （κ）值来估计IOA。结果：共创建90个场景。使用ACR-TIRADS进行IOA分析，ChatGPT和Gemini之间的κ = 0.58， ChatGPT和Claude之间的κ = 0.53， Gemini和Claude之间的κ = 0.90。在EU-TIRADS中，ChatGPT与Gemini的κ值为0.73，ChatGPT与Claude的κ值为0.62，Gemini与Claude的κ值为0.72。在K-TIRADS中，ChatGPT与Gemini之间的κ = 0.88， ChatGPT与Claude之间的κ = 0.70， Gemini与Claude之间的κ = 0.61。结论：本研究发现三种ai之间存在不可忽略的可变性。临床医生和患者应该意识到这些新发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Endocrine ENDOCRINOLOGY & METABOLISM-

CiteScore

6.50

自引率

5.40%

发文量

295

审稿时长

1.5 months

期刊介绍： Well-established as a major journal in today’s rapidly advancing experimental and clinical research areas, Endocrine publishes original articles devoted to basic (including molecular, cellular and physiological studies), translational and clinical research in all the different fields of endocrinology and metabolism. Articles will be accepted based on peer-reviews, priority, and editorial decision. Invited reviews, mini-reviews and viewpoints on relevant pathophysiological and clinical topics, as well as Editorials on articles appearing in the Journal, are published. Unsolicited Editorials will be evaluated by the editorial team. Outcomes of scientific meetings, as well as guidelines and position statements, may be submitted. The Journal also considers special feature articles in the field of endocrine genetics and epigenetics, as well as articles devoted to novel methods and techniques in endocrinology. Endocrine covers controversial, clinical endocrine issues. Meta-analyses on endocrine and metabolic topics are also accepted. Descriptions of single clinical cases and/or small patients studies are not published unless of exceptional interest. However, reports of novel imaging studies and endocrine side effects in single patients may be considered. Research letters and letters to the editor related or unrelated to recently published articles can be submitted. Endocrine covers leading topics in endocrinology such as neuroendocrinology, pituitary and hypothalamic peptides, thyroid physiological and clinical aspects, bone and mineral metabolism and osteoporosis, obesity, lipid and energy metabolism and food intake control, insulin, Type 1 and Type 2 diabetes, hormones of male and female reproduction, adrenal diseases pediatric and geriatric endocrinology, endocrine hypertension and endocrine oncology.