How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models

arXiv - CS - Performance Pub Date : 2024-06-25 DOI:arxiv-2407.11006

Oluyemi Enoch Amujo, Shanchieh Jay Yang

{"title":"How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models","authors":"Oluyemi Enoch Amujo, Shanchieh Jay Yang","doi":"arxiv-2407.11006","DOIUrl":null,"url":null,"abstract":"Recently, large language models (LLMs) have expanded into various domains.\nHowever, there remains a need to evaluate how these models perform when\nprompted with commonplace queries compared to domain-specific queries, which\nmay be useful for benchmarking prior to fine-tuning domain-specific downstream\ntasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across\ndiverse domains, including cybersecurity, medicine, and finance, compared to\ncommon knowledge queries. This study employs a comprehensive methodology to\nevaluate foundational models, encompassing problem formulation, data analysis,\nand the development of novel outlier detection techniques. This methodological\nrigor enhances the credibility of the presented evaluation frameworks. This\nstudy focused on assessing inference time, response length, throughput,\nquality, and resource utilization and investigated the correlations between\nthese factors. The results indicate that model size and types of prompts used\nfor inference significantly influenced response length and quality. In\naddition, common prompts, which include various types of queries, generate\ndiverse and inconsistent responses at irregular intervals. In contrast,\ndomain-specific prompts consistently generate concise responses within a\nreasonable time. Overall, this study underscores the need for comprehensive\nevaluation frameworks to enhance the reliability of benchmarking procedures in\nmultidomain AI research.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.11006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study employs a comprehensive methodology to evaluate foundational models, encompassing problem formulation, data analysis, and the development of novel outlier detection techniques. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.

查看原文本刊更多论文

有多好？评估通用提示与特定领域提示在基础大语言模型上的效果

最近，大型语言模型（LLMs）已经扩展到各个领域。然而，仍有必要评估这些模型在接受普通查询与特定领域查询时的表现，这可能有助于在微调特定领域下游任务之前进行基准测试。本研究评估了 LLM（特别是 Gemma-2B 和 Gemma-7B）在不同领域（包括网络安全、医学和金融）中与普通知识查询的比较。本研究采用综合方法对基础模型进行评估，包括问题提出、数据分析和新型离群点检测技术的开发。这种方法论基础增强了所提出的评估框架的可信度。本研究重点评估了推理时间、响应长度、吞吐量、质量和资源利用率，并调查了这些因素之间的相关性。结果表明，用于推理的模型大小和提示类型对响应时间和质量有显著影响。此外，普通提示（包括各种类型的查询）会在不规则的时间间隔内产生多样且不一致的响应。相比之下，特定领域的提示则能在合理的时间内产生简洁的回答。总之，这项研究强调了在多领域人工智能研究中需要全面的评估框架来提高基准程序的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量