LLM4VV:探索用于验证和核查测试套件的 LLM 即法官

Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran
{"title":"LLM4VV:探索用于验证和核查测试套件的 LLM 即法官","authors":"Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran","doi":"arxiv-2408.11729","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLM) are evolving and have significantly\nrevolutionized the landscape of software development. If used well, they can\nsignificantly accelerate the software development cycle. At the same time, the\ncommunity is very cautious of the models being trained on biased or sensitive\ndata, which can lead to biased outputs along with the inadvertent release of\nconfidential information. Additionally, the carbon footprints and the\nun-explainability of these black box models continue to raise questions about\nthe usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores\nthe idea of judging tests used to evaluate compiler implementations of\ndirective-based programming models as well as probe into the black box of LLMs.\nBased on our results, utilizing an agent-based prompting approach and setting\nup a validation pipeline structure drastically increased the quality of\nDeepSeek Coder, the LLM chosen for the evaluation purposes.","PeriodicalId":501197,"journal":{"name":"arXiv - CS - Programming Languages","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites\",\"authors\":\"Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran\",\"doi\":\"arxiv-2408.11729\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLM) are evolving and have significantly\\nrevolutionized the landscape of software development. If used well, they can\\nsignificantly accelerate the software development cycle. At the same time, the\\ncommunity is very cautious of the models being trained on biased or sensitive\\ndata, which can lead to biased outputs along with the inadvertent release of\\nconfidential information. Additionally, the carbon footprints and the\\nun-explainability of these black box models continue to raise questions about\\nthe usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores\\nthe idea of judging tests used to evaluate compiler implementations of\\ndirective-based programming models as well as probe into the black box of LLMs.\\nBased on our results, utilizing an agent-based prompting approach and setting\\nup a validation pipeline structure drastically increased the quality of\\nDeepSeek Coder, the LLM chosen for the evaluation purposes.\",\"PeriodicalId\":501197,\"journal\":{\"name\":\"arXiv - CS - Programming Languages\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Programming Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.11729\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Programming Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11729","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(LLM)不断发展,极大地改变了软件开发的格局。如果使用得当,它们可以显著加快软件开发周期。与此同时,业界也非常担心模型在有偏见或敏感数据的基础上进行训练,这可能会导致有偏见的输出以及无意中泄露机密信息。此外,这些黑盒子模型的碳足迹和无法解释性继续引发人们对 LLM 可用性的质疑。基于我们的研究结果,利用基于代理的提示方法和建立验证流水线结构大大提高了用于评估的 LLM--DeepSeek Coder 的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites
Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If used well, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these black box models continue to raise questions about the usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models as well as probe into the black box of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信