A Multi-Modal Assessment Framework for Comparison of Specialized Deep Learning and General-Purpose Large Language Models

IF 5.7 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2025-01-30 DOI:10.1109/TBDATA.2025.3536937

Mohammad Nadeem;Shahab Saquib Sohail;Dag Øivind Madsen;Ahmed Ibrahim Alzahrani;Javier Del Ser;Khan Muhammad

{"title":"A Multi-Modal Assessment Framework for Comparison of Specialized Deep Learning and General-Purpose Large Language Models","authors":"Mohammad Nadeem;Shahab Saquib Sohail;Dag Øivind Madsen;Ahmed Ibrahim Alzahrani;Javier Del Ser;Khan Muhammad","doi":"10.1109/TBDATA.2025.3536937","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT-4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT-3.5, GPT-4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT-4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training. However, LLMs outperformed specialized models on the reduced dataset.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1001-1012"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10858454/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT-4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT-3.5, GPT-4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT-4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training. However, LLMs outperformed specialized models on the reduced dataset.

查看原文本刊更多论文

用于比较专业深度学习和通用大型语言模型的多模态评估框架

近年来，人工智能工具（例如，ChatGPT， GPT-4和Bard）在大型语言模型（llm）不断增长的能力，推理和效率的推动下取得了巨大的进步。法学硕士在从诗歌写作和编码到论文生成和解谜等任务中表现出色。尽管它们精通一般查询，但隐喻理解和假新闻检测等专业任务通常需要精细调整的模型，这与专业深度学习（DL）构成了比较挑战。我们提出了一个评估框架来比较特定任务智力与通用llm在自杀和抑郁倾向识别方面的作用。为此，我们在自杀和抑郁检测数据集上训练了两个深度学习模型，然后在测试集上测试它们的性能。之后，使用相同的测试数据集使用四种分类指标来评估四种LLMs （GPT-3.5, GPT-4， b谷歌Bard和MS Bing）的性能。其中，基于bert的深度学习模型表现最好，测试准确率为94.61%，GPT-4以92.5%的准确率位居第二。结果表明，llm并不优于专门的DL模型，但能够达到相当的性能，使其成为无需专门训练的下游任务的不错选择。然而，llm在简化数据集上的表现优于专门的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.