LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

arXiv - CS - Programming Languages Pub Date : 2024-08-21 DOI:arxiv-2408.11729

Zachariah Sollenberger, Jay Patel, Christian Munley, Aaron Jarmusch, Sunita Chandrasekaran

引用次数: 0

Abstract

Large Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If used well, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these black box models continue to raise questions about the usability of LLMs. With the abundance of opportunities LLMs have to offer, this paper explores the idea of judging tests used to evaluate compiler implementations of directive-based programming models as well as probe into the black box of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.

查看原文本刊更多论文

LLM4VV：探索用于验证和核查测试套件的 LLM 即法官

大型语言模型（LLM）不断发展，极大地改变了软件开发的格局。如果使用得当，它们可以显著加快软件开发周期。与此同时，业界也非常担心模型在有偏见或敏感数据的基础上进行训练，这可能会导致有偏见的输出以及无意中泄露机密信息。此外，这些黑盒子模型的碳足迹和无法解释性继续引发人们对 LLM 可用性的质疑。基于我们的研究结果，利用基于代理的提示方法和建立验证流水线结构大大提高了用于评估的 LLM--DeepSeek Coder 的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Programming Languages

自引率

0.00%

发文量