RETAIN: Interactive Tool for Regression Testing Guided LLM Migration

arXiv - CS - Information Retrieval Pub Date : 2024-09-05 DOI:arxiv-2409.03928

Tanay Dixit, Daniel Lee, Sally Fang, Sai Sree Harsha, Anirudh Sureshan, Akash Maharaj, Yunyao Li

{"title":"RETAIN: Interactive Tool for Regression Testing Guided LLM Migration","authors":"Tanay Dixit, Daniel Lee, Sally Fang, Sai Sree Harsha, Anirudh Sureshan, Akash Maharaj, Yunyao Li","doi":"arxiv-2409.03928","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) are increasingly integrated into diverse\napplications. The rapid evolution of LLMs presents opportunities for developers\nto enhance applications continuously. However, this constant adaptation can\nalso lead to performance regressions during model migrations. While several\ninteractive tools have been proposed to streamline the complexity of prompt\nengineering, few address the specific requirements of regression testing for\nLLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing\nguided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM\nMigrations. RETAIN comprises two key components: an interactive interface\ntailored to regression testing needs during LLM migrations, and an error\ndiscovery module that facilitates understanding of differences in model\nbehaviors. The error discovery module generates textual descriptions of various\nerrors or differences between model outputs, providing actionable insights for\nprompt refinement. Our automatic evaluation and empirical user studies\ndemonstrate that RETAIN, when compared to manual evaluation, enabled\nparticipants to identify twice as many errors, facilitated experimentation with\n75% more prompts, and achieves 12% higher metric scores in a given time frame.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"55 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.03928","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) are increasingly integrated into diverse applications. The rapid evolution of LLMs presents opportunities for developers to enhance applications continuously. However, this constant adaptation can also lead to performance regressions during model migrations. While several interactive tools have been proposed to streamline the complexity of prompt engineering, few address the specific requirements of regression testing for LLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations. RETAIN comprises two key components: an interactive interface tailored to regression testing needs during LLM migrations, and an error discovery module that facilitates understanding of differences in model behaviors. The error discovery module generates textual descriptions of various errors or differences between model outputs, providing actionable insights for prompt refinement. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.

查看原文本刊更多论文

RETAIN：引导 LLM 迁移的回归测试互动工具

大型语言模型（LLM）越来越多地被集成到各种应用中。LLM 的快速发展为开发人员提供了不断增强应用的机会。然而，这种不断的适应性也会导致模型迁移过程中的性能下降。虽然已经提出了一些交互式工具来简化复杂的提示工程，但很少有工具能满足 LLM 迁移回归测试的特定要求。为了弥合这一差距，我们引入了 RETAIN（REgression Testingguided LLM migrAtIoN），这是一款专门为 LLMM 迁移中的回归测试而设计的工具。RETAIN 由两个关键部分组成：在 LLM 迁移过程中满足回归测试需求的交互式界面，以及有助于理解模型行为差异的错误发现模块。错误发现模块生成各种错误或模型输出之间差异的文本描述，为及时改进提供可操作的见解。我们的自动评估和实证用户研究表明，与人工评估相比，RETAIN 使参与者识别的错误数量增加了一倍，促进实验的提示增加了 75%，在给定时间内获得的指标分数提高了 12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Information Retrieval

自引率

0.00%

发文量