Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2025-02-13 DOI:10.1109/TBDATA.2025.3536928

Zihao Wu;Lu Zhang;Chao Cao;Xiaowei Yu;Zhengliang Liu;Lin Zhao;Yiwei Li;Haixing Dai;Chong Ma;Gang Li;Wei Liu;Quanzheng Li;Dinggang Shen;Xiang Li;Dajiang Zhu;Tianming Liu

{"title":"Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task","authors":"Zihao Wu;Lu Zhang;Chao Cao;Xiaowei Yu;Zhengliang Liu;Lin Zhao;Yiwei Li;Haixing Dai;Chong Ma;Gang Li;Wei Liu;Quanzheng Li;Dinggang Shen;Xiang Li;Dajiang Zhu;Tianming Liu","doi":"10.1109/TBDATA.2025.3536928","DOIUrl":null,"url":null,"abstract":"Recently, ChatGPT and GPT-4 have emerged and gained immense global attention due to their unparalleled performance in language processing. Despite demonstrating impressive capability in various open-domain tasks, their adequacy in highly specific fields like radiology remains untested. Radiology presents unique linguistic phenomena distinct from open-domain data due to its specificity and complexity. Assessing the performance of large language models (LLMs) in such specific domains is crucial not only for a thorough evaluation of their overall performance but also for providing valuable insights into future model design directions: whether model design should be generic or domain-specific. To this end, in this study, we evaluate the performance of ChatGPT/GPT-4 on a radiology natural language inference (NLI) task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4’s reasoning ability by introducing varying levels of inference difficulty. Our results show that 1) ChatGPT and GPT-4 outperform other LLMs in the radiology NLI task and 2) other specifically fine-tuned Bert-based models require significant amounts of data samples to achieve comparable performance to ChatGPT/GPT-4. These findings not only demonstrate the feasibility and promise of constructing a generic model capable of addressing various tasks across different domains, but also highlight several key factors crucial for developing a unified model, particularly in a medical context, paving the way for future artificial general intelligence (AGI) systems. We release our code and data to the research community.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1027-1041"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887002/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, ChatGPT and GPT-4 have emerged and gained immense global attention due to their unparalleled performance in language processing. Despite demonstrating impressive capability in various open-domain tasks, their adequacy in highly specific fields like radiology remains untested. Radiology presents unique linguistic phenomena distinct from open-domain data due to its specificity and complexity. Assessing the performance of large language models (LLMs) in such specific domains is crucial not only for a thorough evaluation of their overall performance but also for providing valuable insights into future model design directions: whether model design should be generic or domain-specific. To this end, in this study, we evaluate the performance of ChatGPT/GPT-4 on a radiology natural language inference (NLI) task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4’s reasoning ability by introducing varying levels of inference difficulty. Our results show that 1) ChatGPT and GPT-4 outperform other LLMs in the radiology NLI task and 2) other specifically fine-tuned Bert-based models require significant amounts of data samples to achieve comparable performance to ChatGPT/GPT-4. These findings not only demonstrate the feasibility and promise of constructing a generic model capable of addressing various tasks across different domains, but also highlight several key factors crucial for developing a unified model, particularly in a medical context, paving the way for future artificial general intelligence (AGI) systems. We release our code and data to the research community.

查看原文本刊更多论文

探索权衡：统一的大型语言模型与高度特异性放射学NLI任务的局部微调模型

近年来，ChatGPT和GPT-4因其在语言处理方面无与伦比的性能而引起了全球的广泛关注。尽管在各种开放领域任务中表现出令人印象深刻的能力，但它们在高度特定的领域（如放射学）的充分性仍未经检验。放射学由于其特殊性和复杂性，呈现出不同于开放域数据的独特语言现象。在这些特定领域中评估大型语言模型（llm）的性能不仅对于全面评估它们的整体性能至关重要，而且对于为未来的模型设计方向提供有价值的见解：模型设计应该是通用的还是特定于领域的。为此，在本研究中，我们评估了ChatGPT/GPT-4在放射学自然语言推理（NLI）任务上的性能，并将其与其他专门针对任务相关数据样本进行微调的模型进行了比较。我们还通过引入不同级别的推理难度，对ChatGPT/GPT-4的推理能力进行了全面的调查。我们的研究结果表明：1)ChatGPT和GPT-4在放射学NLI任务中的表现优于其他llm； 2)其他特别微调的基于bert的模型需要大量的数据样本才能达到与ChatGPT/GPT-4相当的性能。这些发现不仅证明了构建能够解决不同领域各种任务的通用模型的可行性和前景，而且还强调了开发统一模型（特别是在医学背景下）的几个关键因素，为未来的通用人工智能（AGI）系统铺平了道路。我们向研究社区发布代码和数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.