Systematic comparison of GPT models for the analysis of pathology reports in a low-resource language: A case study for Turkish.

IF 1.9 4区 医学 Q2 PATHOLOGY
Omer Faruk Dilbaz, Muhammet Nusret Ozates, Beyza Bolat, Cigdem Gunduz-Demir, Ibrahim Kulac
{"title":"Systematic comparison of GPT models for the analysis of pathology reports in a low-resource language: A case study for Turkish.","authors":"Omer Faruk Dilbaz, Muhammet Nusret Ozates, Beyza Bolat, Cigdem Gunduz-Demir, Ibrahim Kulac","doi":"10.1093/ajcp/aqaf091","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Large language models (LLMs) can process text for various applications, including surgical pathology reports, but studies primarily focus on English. Their performance has not been systematically studied for a low-resource language. To analyze the performance of various LLMs, 759 Turkish pathology reports from 5 different procedures were selected.</p><p><strong>Methods: </strong>We used 10 examples from every procedure to optimize prompts for OpenAI's GPT-3.5 Turbo, GPT-4o mini, and GPT-4o. The rest was used to test generalizability.</p><p><strong>Results: </strong>The GPT-4o model performed superior in processing Turkish reports (12%-25% over GPT-3.5 Turbo, 3%-16% over GPT-4o mini). English-translated versions of the reports have been demonstrated to enhance accuracy, especially for GPT-3.5 Turbo and GPT-4o mini. GPT4-o showed comparable results for Turkish and English. A 12% to 22% performance gap was observed between GPT-4o and GPT-3.5 Turbo for English-translated reports. Domain-related tips in prompts increased accuracy. Results of larger test sets were parallel for all models with the validation set. The GPT-4o model yielded the most accurate results, while the GPT-4o mini model demonstrated intermediate performance. The GPT-3.5 Turbo model exhibited the least accuracy.</p><p><strong>Conclusions: </strong>To our knowledge, for the first time in the literature, we have demonstrated the performance of GPT models in Turkish surgical pathology reports, and results indicate that data extracted by GPT-4o are almost ready for direct application.</p>","PeriodicalId":7506,"journal":{"name":"American journal of clinical pathology","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of clinical pathology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ajcp/aqaf091","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PATHOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Large language models (LLMs) can process text for various applications, including surgical pathology reports, but studies primarily focus on English. Their performance has not been systematically studied for a low-resource language. To analyze the performance of various LLMs, 759 Turkish pathology reports from 5 different procedures were selected.

Methods: We used 10 examples from every procedure to optimize prompts for OpenAI's GPT-3.5 Turbo, GPT-4o mini, and GPT-4o. The rest was used to test generalizability.

Results: The GPT-4o model performed superior in processing Turkish reports (12%-25% over GPT-3.5 Turbo, 3%-16% over GPT-4o mini). English-translated versions of the reports have been demonstrated to enhance accuracy, especially for GPT-3.5 Turbo and GPT-4o mini. GPT4-o showed comparable results for Turkish and English. A 12% to 22% performance gap was observed between GPT-4o and GPT-3.5 Turbo for English-translated reports. Domain-related tips in prompts increased accuracy. Results of larger test sets were parallel for all models with the validation set. The GPT-4o model yielded the most accurate results, while the GPT-4o mini model demonstrated intermediate performance. The GPT-3.5 Turbo model exhibited the least accuracy.

Conclusions: To our knowledge, for the first time in the literature, we have demonstrated the performance of GPT models in Turkish surgical pathology reports, and results indicate that data extracted by GPT-4o are almost ready for direct application.

低资源语言病理报告分析的GPT模型的系统比较:土耳其语的案例研究。
目的:大型语言模型(LLMs)可以处理各种应用的文本,包括外科病理报告,但研究主要集中在英语上。对于低资源语言,它们的表现还没有系统的研究。为了分析不同llm的表现,我们选择了来自5种不同手术的759份土耳其病理学报告。方法:对OpenAI的GPT-3.5 Turbo、gpt - 40 mini和gpt - 40进行提示优化。其余的用来测试普遍性。结果:gpt - 40模型在处理土耳其报告方面表现优越(比GPT-3.5 Turbo高12%-25%,比gpt - 40 mini高3%-16%)。报告的英文翻译版本已被证明可以提高准确性,特别是对于GPT-3.5 Turbo和gpt - 40 mini。gpt4 - 0对土耳其语和英语的测试结果相当。在英文翻译报告中,gpt - 40和GPT-3.5 Turbo之间的性能差距为12%至22%。提示中的领域相关提示提高了准确性。对于所有具有验证集的模型,较大测试集的结果是平行的。gpt - 40模型获得了最准确的结果,而gpt - 40迷你模型则表现出中等的性能。GPT-3.5 Turbo模型的准确率最低。结论:据我们所知,在文献中,我们首次在土耳其外科病理报告中展示了GPT模型的性能,结果表明GPT- 40提取的数据几乎可以直接应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.70
自引率
2.90%
发文量
367
审稿时长
3-6 weeks
期刊介绍: The American Journal of Clinical Pathology (AJCP) is the official journal of the American Society for Clinical Pathology and the Academy of Clinical Laboratory Physicians and Scientists. It is a leading international journal for publication of articles concerning novel anatomic pathology and laboratory medicine observations on human disease. AJCP emphasizes articles that focus on the application of evolving technologies for the diagnosis and characterization of diseases and conditions, as well as those that have a direct link toward improving patient care.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信