{"title":"有多好?评估通用提示与特定领域提示在基础大语言模型上的效果","authors":"Oluyemi Enoch Amujo, Shanchieh Jay Yang","doi":"arxiv-2407.11006","DOIUrl":null,"url":null,"abstract":"Recently, large language models (LLMs) have expanded into various domains.\nHowever, there remains a need to evaluate how these models perform when\nprompted with commonplace queries compared to domain-specific queries, which\nmay be useful for benchmarking prior to fine-tuning domain-specific downstream\ntasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across\ndiverse domains, including cybersecurity, medicine, and finance, compared to\ncommon knowledge queries. This study employs a comprehensive methodology to\nevaluate foundational models, encompassing problem formulation, data analysis,\nand the development of novel outlier detection techniques. This methodological\nrigor enhances the credibility of the presented evaluation frameworks. This\nstudy focused on assessing inference time, response length, throughput,\nquality, and resource utilization and investigated the correlations between\nthese factors. The results indicate that model size and types of prompts used\nfor inference significantly influenced response length and quality. In\naddition, common prompts, which include various types of queries, generate\ndiverse and inconsistent responses at irregular intervals. In contrast,\ndomain-specific prompts consistently generate concise responses within a\nreasonable time. Overall, this study underscores the need for comprehensive\nevaluation frameworks to enhance the reliability of benchmarking procedures in\nmultidomain AI research.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models\",\"authors\":\"Oluyemi Enoch Amujo, Shanchieh Jay Yang\",\"doi\":\"arxiv-2407.11006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, large language models (LLMs) have expanded into various domains.\\nHowever, there remains a need to evaluate how these models perform when\\nprompted with commonplace queries compared to domain-specific queries, which\\nmay be useful for benchmarking prior to fine-tuning domain-specific downstream\\ntasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across\\ndiverse domains, including cybersecurity, medicine, and finance, compared to\\ncommon knowledge queries. This study employs a comprehensive methodology to\\nevaluate foundational models, encompassing problem formulation, data analysis,\\nand the development of novel outlier detection techniques. This methodological\\nrigor enhances the credibility of the presented evaluation frameworks. This\\nstudy focused on assessing inference time, response length, throughput,\\nquality, and resource utilization and investigated the correlations between\\nthese factors. The results indicate that model size and types of prompts used\\nfor inference significantly influenced response length and quality. In\\naddition, common prompts, which include various types of queries, generate\\ndiverse and inconsistent responses at irregular intervals. In contrast,\\ndomain-specific prompts consistently generate concise responses within a\\nreasonable time. Overall, this study underscores the need for comprehensive\\nevaluation frameworks to enhance the reliability of benchmarking procedures in\\nmultidomain AI research.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"46 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.11006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.11006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models
Recently, large language models (LLMs) have expanded into various domains.
However, there remains a need to evaluate how these models perform when
prompted with commonplace queries compared to domain-specific queries, which
may be useful for benchmarking prior to fine-tuning domain-specific downstream
tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across
diverse domains, including cybersecurity, medicine, and finance, compared to
common knowledge queries. This study employs a comprehensive methodology to
evaluate foundational models, encompassing problem formulation, data analysis,
and the development of novel outlier detection techniques. This methodological
rigor enhances the credibility of the presented evaluation frameworks. This
study focused on assessing inference time, response length, throughput,
quality, and resource utilization and investigated the correlations between
these factors. The results indicate that model size and types of prompts used
for inference significantly influenced response length and quality. In
addition, common prompts, which include various types of queries, generate
diverse and inconsistent responses at irregular intervals. In contrast,
domain-specific prompts consistently generate concise responses within a
reasonable time. Overall, this study underscores the need for comprehensive
evaluation frameworks to enhance the reliability of benchmarking procedures in
multidomain AI research.