Towards Optimal NLP Solutions: Analyzing GPT and LLaMA-2 Models Across Model Scale, Dataset Size, and Task Diversity

Ankit Kumar, Richa Sharma, Punam Bedi
{"title":"Towards Optimal NLP Solutions: Analyzing GPT and LLaMA-2 Models Across Model Scale, Dataset Size, and Task Diversity","authors":"Ankit Kumar, Richa Sharma, Punam Bedi","doi":"10.48084/etasr.7200","DOIUrl":null,"url":null,"abstract":"This study carries out a comprehensive comparison of fine-tuned GPT models (GPT-2, GPT-3, GPT-3.5) and LLaMA-2 models (LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B) in text classification, addressing dataset sizes, model scales, and task diversity. Since its inception in 2018, the GPT series has been pivotal in advancing NLP, with each iteration introducing substantial enhancements. Despite its progress, detailed analyses, especially against competitive open-source models like the LLaMA-2 series in text classification, remain scarce. The current study fills this gap by fine-tuning these models across varied datasets, focusing on enhancing task-specific performance in hate speech and offensive language detection, fake news classification, and sentiment analysis. The learning efficacy and efficiency of the GPT and LLaMA-2 models were evaluated, providing a nuanced guide to choosing optimal models for NLP tasks based on architectural benefits and adaptation efficiency with limited data and resources. In particular, even with datasets as small as 1,000 rows per class, the F1 scores for the GPT-3.5 and LLaMA-2 models exceeded 0.9, reaching 0.99 with complete datasets. Additionally, the LLaMA-2 13B and 70B models outperformed GPT-3, demonstrating their superior efficiency and effectiveness in text classification. Both the GPT and LLaMA-2 series showed commendable performance on all three tasks, underscoring their ability to handle a diversity of tasks. Based on the size, performance, and resources required for fine-tuning the model, this study identifies LLaMA-2 13B as the most optimal model for NLP tasks.","PeriodicalId":364936,"journal":{"name":"Engineering, Technology & Applied Science Research","volume":"131 30","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering, Technology & Applied Science Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48084/etasr.7200","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This study carries out a comprehensive comparison of fine-tuned GPT models (GPT-2, GPT-3, GPT-3.5) and LLaMA-2 models (LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B) in text classification, addressing dataset sizes, model scales, and task diversity. Since its inception in 2018, the GPT series has been pivotal in advancing NLP, with each iteration introducing substantial enhancements. Despite its progress, detailed analyses, especially against competitive open-source models like the LLaMA-2 series in text classification, remain scarce. The current study fills this gap by fine-tuning these models across varied datasets, focusing on enhancing task-specific performance in hate speech and offensive language detection, fake news classification, and sentiment analysis. The learning efficacy and efficiency of the GPT and LLaMA-2 models were evaluated, providing a nuanced guide to choosing optimal models for NLP tasks based on architectural benefits and adaptation efficiency with limited data and resources. In particular, even with datasets as small as 1,000 rows per class, the F1 scores for the GPT-3.5 and LLaMA-2 models exceeded 0.9, reaching 0.99 with complete datasets. Additionally, the LLaMA-2 13B and 70B models outperformed GPT-3, demonstrating their superior efficiency and effectiveness in text classification. Both the GPT and LLaMA-2 series showed commendable performance on all three tasks, underscoring their ability to handle a diversity of tasks. Based on the size, performance, and resources required for fine-tuning the model, this study identifies LLaMA-2 13B as the most optimal model for NLP tasks.
实现最佳 NLP 解决方案:跨模型规模、数据集大小和任务多样性分析 GPT 和 LLaMA-2 模型
本研究针对数据集规模、模型规模和任务多样性,对文本分类中的微调 GPT 模型(GPT-2、GPT-3、GPT-3.5)和 LLaMA-2 模型(LLaMA-2 7B、LLaMA-2 13B、LLaMA-2 70B)进行了全面比较。自 2018 年推出以来,GPT 系列在推进 NLP 方面发挥了举足轻重的作用,每一次迭代都带来了实质性的提升。尽管取得了进步,但详细的分析,尤其是与文本分类中具有竞争力的开源模型(如 LLaMA-2 系列)的对比分析仍然很少。本研究填补了这一空白,在不同的数据集上对这些模型进行了微调,重点提高了仇恨言论和攻击性语言检测、假新闻分类和情感分析等特定任务的性能。研究评估了 GPT 模型和 LLaMA-2 模型的学习效果和效率,为在有限的数据和资源条件下根据架构优势和适应效率为 NLP 任务选择最佳模型提供了细致入微的指导。特别是,即使数据集小到每类 1,000 行,GPT-3.5 和 LLaMA-2 模型的 F1 分数也超过了 0.9,在完整数据集上达到了 0.99。此外,LLaMA-2 13B 和 70B 模型的表现也优于 GPT-3,证明了它们在文本分类方面的卓越效率和有效性。GPT 和 LLaMA-2 系列在所有三个任务中都表现出了值得称道的性能,突显了它们处理各种任务的能力。基于模型的大小、性能和微调所需的资源,本研究认为 LLaMA-2 13B 是 NLP 任务的最佳模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信