用零机会、少机会和思维链提示对低资源语言的开放和闭源llm进行评估

Natural Language Processing Journal Pub Date : 2025-01-03 DOI:10.1016/j.nlp.2024.100124

Zabir Al Nazi , Md. Rajib Hossain , Faisal Al Mamun

{"title":"用零机会、少机会和思维链提示对低资源语言的开放和闭源llm进行评估","authors":"Zabir Al Nazi , Md. Rajib Hossain , Faisal Al Mamun","doi":"10.1016/j.nlp.2024.100124","DOIUrl":null,"url":null,"abstract":"<div><div>As the global deployment of Large Language Models (LLMs) increases, the demand for multilingual capabilities becomes more crucial. While many LLMs excel in real-time applications for high-resource languages, few are tailored specifically for low-resource languages. The limited availability of text corpora for low-resource languages, coupled with their minimal utilization during LLM training, hampers the models’ ability to perform effectively in real-time applications. Additionally, evaluations of LLMs are significantly less extensive for low-resource languages. This study offers a comprehensive evaluation of both open-source and closed-source multilingual LLMs focused on low-resource language like Bengali, a language that remains notably underrepresented in computational linguistics. Despite the limited number of pre-trained models exclusively on Bengali, we assess the performance of six prominent LLMs, i.e., three closed-source (GPT-3.5, GPT-4o, Gemini) and three open-source (Aya 101, BLOOM, LLaMA) across key natural language processing (NLP) tasks, including text classification, sentiment analysis, summarization, and question answering. These tasks were evaluated using three prompting techniques: Zero-Shot, Few-Shot, and Chain-of-Thought (CoT). This study found that the default hyperparameters of these pre-trained models, such as temperature, maximum token limit, and the number of few-shot examples, did not yield optimal outcomes and led to hallucination issues in many instances. To address these challenges, ablation studies were conducted on key hyperparameters, particularly temperature and the number of shots, to optimize Few-Shot learning and enhance model performance. The focus of this research is on understanding how these LLMs adapt to low-resource downstream tasks, emphasizing their linguistic flexibility and contextual understanding. Experimental results demonstrated that the closed-source GPT-4o model, utilizing Few-Shot learning and Chain-of-Thought prompting, achieved the highest performance across multiple tasks: an F1 score of 84.54% for text classification, 99.00% for sentiment analysis, a <span><math><mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>b</mi><mi>e</mi><mi>r</mi><mi>t</mi></mrow></msub></mrow></math></span> score of 72.87% for summarization, and 58.22% for question answering. For transparency and reproducibility, all methodologies and code from this study are available on our GitHub repository: <span><span>https://github.com/zabir-nabil/bangla-multilingual-llm-eval</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"10 ","pages":"Article 100124"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting\",\"authors\":\"Zabir Al Nazi , Md. Rajib Hossain , Faisal Al Mamun\",\"doi\":\"10.1016/j.nlp.2024.100124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>As the global deployment of Large Language Models (LLMs) increases, the demand for multilingual capabilities becomes more crucial. While many LLMs excel in real-time applications for high-resource languages, few are tailored specifically for low-resource languages. The limited availability of text corpora for low-resource languages, coupled with their minimal utilization during LLM training, hampers the models’ ability to perform effectively in real-time applications. Additionally, evaluations of LLMs are significantly less extensive for low-resource languages. This study offers a comprehensive evaluation of both open-source and closed-source multilingual LLMs focused on low-resource language like Bengali, a language that remains notably underrepresented in computational linguistics. Despite the limited number of pre-trained models exclusively on Bengali, we assess the performance of six prominent LLMs, i.e., three closed-source (GPT-3.5, GPT-4o, Gemini) and three open-source (Aya 101, BLOOM, LLaMA) across key natural language processing (NLP) tasks, including text classification, sentiment analysis, summarization, and question answering. These tasks were evaluated using three prompting techniques: Zero-Shot, Few-Shot, and Chain-of-Thought (CoT). This study found that the default hyperparameters of these pre-trained models, such as temperature, maximum token limit, and the number of few-shot examples, did not yield optimal outcomes and led to hallucination issues in many instances. To address these challenges, ablation studies were conducted on key hyperparameters, particularly temperature and the number of shots, to optimize Few-Shot learning and enhance model performance. The focus of this research is on understanding how these LLMs adapt to low-resource downstream tasks, emphasizing their linguistic flexibility and contextual understanding. Experimental results demonstrated that the closed-source GPT-4o model, utilizing Few-Shot learning and Chain-of-Thought prompting, achieved the highest performance across multiple tasks: an F1 score of 84.54% for text classification, 99.00% for sentiment analysis, a <span><math><mrow><mi>F</mi><msub><mrow><mn>1</mn></mrow><mrow><mi>b</mi><mi>e</mi><mi>r</mi><mi>t</mi></mrow></msub></mrow></math></span> score of 72.87% for summarization, and 58.22% for question answering. For transparency and reproducibility, all methodologies and code from this study are available on our GitHub repository: <span><span>https://github.com/zabir-nabil/bangla-multilingual-llm-eval</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"10 \",\"pages\":\"Article 100124\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000724\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着大型语言模型（llm）的全球部署的增加，对多语言功能的需求变得更加重要。虽然许多法学硕士擅长于高资源语言的实时应用程序，但很少有专门为低资源语言量身定制的。低资源语言的文本语料库的有限可用性，加上它们在LLM培训期间的最低利用率，阻碍了模型在实时应用程序中有效执行的能力。此外，对于资源匮乏的语言，llm的评估也不那么广泛。本研究对开源和闭源多语言法学硕士进行了全面评估，重点关注孟加拉语等低资源语言，这是一种在计算语言学中仍然明显代表性不足的语言。尽管专门针对孟加拉语的预训练模型数量有限，但我们评估了六个杰出的llm的性能，即三个闭源（GPT-3.5, gpt - 40, Gemini）和三个开源（Aya 101， BLOOM, LLaMA）在关键自然语言处理（NLP）任务上的表现，包括文本分类，情感分析，摘要和问题回答。这些任务使用三种提示技术进行评估：零提示、少提示和思维链（CoT）。本研究发现，这些预训练模型的默认超参数，如温度、最大令牌限制和少数射击示例的数量，并没有产生最佳结果，并在许多情况下导致幻觉问题。为了应对这些挑战，研究人员对关键超参数（特别是温度和射击次数）进行了烧蚀研究，以优化Few-Shot学习并提高模型性能。本研究的重点是了解这些法学硕士如何适应低资源下游任务，强调他们的语言灵活性和上下文理解能力。实验结果表明，利用Few-Shot学习和思维链提示的闭源gpt - 40模型在多任务中取得了最高的性能：文本分类F1得分为84.54%，情感分析得分为99.00%，摘要得分为72.87%，问答得分为58.22%。为了透明度和可重复性，本研究的所有方法和代码都可以在我们的GitHub存储库中获得：https://github.com/zabir-nabil/bangla-multilingual-llm-eval。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting

As the global deployment of Large Language Models (LLMs) increases, the demand for multilingual capabilities becomes more crucial. While many LLMs excel in real-time applications for high-resource languages, few are tailored specifically for low-resource languages. The limited availability of text corpora for low-resource languages, coupled with their minimal utilization during LLM training, hampers the models’ ability to perform effectively in real-time applications. Additionally, evaluations of LLMs are significantly less extensive for low-resource languages. This study offers a comprehensive evaluation of both open-source and closed-source multilingual LLMs focused on low-resource language like Bengali, a language that remains notably underrepresented in computational linguistics. Despite the limited number of pre-trained models exclusively on Bengali, we assess the performance of six prominent LLMs, i.e., three closed-source (GPT-3.5, GPT-4o, Gemini) and three open-source (Aya 101, BLOOM, LLaMA) across key natural language processing (NLP) tasks, including text classification, sentiment analysis, summarization, and question answering. These tasks were evaluated using three prompting techniques: Zero-Shot, Few-Shot, and Chain-of-Thought (CoT). This study found that the default hyperparameters of these pre-trained models, such as temperature, maximum token limit, and the number of few-shot examples, did not yield optimal outcomes and led to hallucination issues in many instances. To address these challenges, ablation studies were conducted on key hyperparameters, particularly temperature and the number of shots, to optimize Few-Shot learning and enhance model performance. The focus of this research is on understanding how these LLMs adapt to low-resource downstream tasks, emphasizing their linguistic flexibility and contextual understanding. Experimental results demonstrated that the closed-source GPT-4o model, utilizing Few-Shot learning and Chain-of-Thought prompting, achieved the highest performance across multiple tasks: an F1 score of 84.54% for text classification, 99.00% for sentiment analysis, a

F 1_{b e r t}

score of 72.87% for summarization, and 58.22% for question answering. For transparency and reproducibility, all methodologies and code from this study are available on our GitHub repository: https://github.com/zabir-nabil/bangla-multilingual-llm-eval.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量