Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Cognitive Computation Pub Date : 2024-05-27 DOI:10.1007/s12559-024-10281-5

Mohammad Nadeem, Shahab Saquib Sohail, Laeeba Javed, Faisal Anwer, Abdul Khader Jilani Saudagar, Khan Muhammad

{"title":"Vision-Enabled Large Language and Deep Learning Models for Image-Based Emotion Recognition","authors":"Mohammad Nadeem, Shahab Saquib Sohail, Laeeba Javed, Faisal Anwer, Abdul Khader Jilani Saudagar, Khan Muhammad","doi":"10.1007/s12559-024-10281-5","DOIUrl":null,"url":null,"abstract":"The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: \\(CNN_1\\) and \\(CNN_2\\)), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The \\(CNN_2\\) was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than \\(CNN_1\\) and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"97 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-024-10281-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The significant advancements in the capabilities, reasoning, and efficiency of artificial intelligence (AI)-based tools and systems are evident. Some noteworthy examples of such tools include generative AI-based large language models (LLMs) such as generative pretrained transformer 3.5 (GPT 3.5), generative pretrained transformer 4 (GPT-4), and Bard. LLMs are versatile and effective for various tasks such as composing poetry, writing codes, generating essays, and solving puzzles. Thus far, LLMs can only effectively process text-based input. However, recent advancements have enabled them to handle multimodal inputs, such as text, images, and audio, making them highly general-purpose tools. LLMs have achieved decent performance in pattern recognition tasks (such as classification), therefore, there is a curiosity about whether general-purpose LLMs can perform comparable or even superior to specialized deep learning models (DLMs) trained specifically for a given task. In this study, we compared the performances of fine-tuned DLMs with those of general-purpose LLMs for image-based emotion recognition. We trained DLMs, namely, a convolutional neural network (CNN) (two CNN models were used: \(CNN_1\) and \(CNN_2\)), ResNet50, and VGG-16 models, using an image dataset for emotion recognition, and then tested their performance on another dataset. Subsequently, we subjected the same testing dataset to two vision-enabled LLMs (LLaVa and GPT-4). The \(CNN_2\) was found to be the superior model with an accuracy of 62% while VGG16 produced the lowest accuracy with 31%. In the category of LLMs, GPT-4 performed the best, with an accuracy of 55.81%. LLava LLM had a higher accuracy than \(CNN_1\) and VGG16 models. The other performance metrics such as precision, recall, and F1-score followed similar trends. However, GPT-4 performed the best with small datasets. The poor results observed in LLMs can be attributed to their general-purpose nature, which, despite extensive pretraining, may not fully capture the features required for specific tasks like emotion recognition in images as effectively as models fine-tuned for those tasks. The LLMs did not surpass specialized models but achieved comparable performance, making them a viable option for specific tasks without additional training. In addition, LLMs can be considered a good alternative when the available dataset is small.

Abstract Image

查看原文本刊更多论文

基于视觉的大型语言和深度学习模型，用于基于图像的情感识别

基于人工智能（AI）的工具和系统在能力、推理和效率方面的巨大进步是显而易见的。这类工具中值得一提的例子包括基于生成式人工智能的大型语言模型（LLM），如生成式预训练变换器 3.5（GPT 3.5）、生成式预训练变换器 4（GPT-4）和巴德（Bard）。LLMs 用途广泛，可有效完成各种任务，如创作诗歌、编写代码、生成文章和解谜。迄今为止，LLM 只能有效处理基于文本的输入。然而，最近的进步使它们能够处理多模态输入，如文本、图像和音频，从而使它们成为高度通用的工具。LLM 在模式识别任务（如分类）中取得了不俗的表现，因此，人们对通用 LLM 的表现是否能与专为特定任务训练的专业深度学习模型（DLM）相媲美甚至更胜一筹充满了好奇。在本研究中，我们比较了微调 DLM 与通用 LLM 在基于图像的情感识别中的表现。我们训练了 DLMs，即一个卷积神经网络（CNN）（使用了两个 CNN 模型：\(CNN_1\)和\(CNN_2\))、ResNet50和VGG-16模型，然后在另一个数据集上测试它们的性能。随后，我们将同一个测试数据集交给了两个支持视觉的 LLM（LLaVa 和 GPT-4）。结果发现，CNN_2\是最优秀的模型，准确率为62%，而VGG16的准确率最低，只有31%。在 LLM 类别中，GPT-4 表现最好，准确率为 55.81%。LLava LLM 的准确率高于（CNN_1）和 VGG16 模型。其他性能指标，如精确度、召回率和 F1 分数，也呈现出类似的趋势。然而，GPT-4 在小型数据集上的表现最好。在 LLMs 中观察到的较差结果可归因于它们的通用性，尽管进行了大量的预训练，但它们可能无法像针对特定任务微调的模型那样有效地捕捉特定任务（如图像中的情感识别）所需的特征。LLM 并没有超越专用模型，但取得了不相上下的性能，这使它们成为无需额外训练即可完成特定任务的可行选择。此外，当可用数据集较少时，LLMs 也可被视为一种很好的选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES

CiteScore

9.30

自引率

3.70%

发文量

116

审稿时长

>12 weeks

期刊介绍： Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.