乒乓球：带用户模拟和多模型评估的角色扮演语言模型基准

arXiv - CS - Computation and Language Pub Date : 2024-09-10 DOI:arxiv-2409.06820

Ilya Gusev

{"title":"乒乓球：带用户模拟和多模型评估的角色扮演语言模型基准","authors":"Ilya Gusev","doi":"arxiv-2409.06820","DOIUrl":null,"url":null,"abstract":"We introduce a novel benchmark for evaluating the role-playing capabilities\nof language models. Our approach leverages language models themselves to\nemulate users in dynamic, multi-turn conversations and to assess the resulting\ndialogues. The framework consists of three main components: a player model\nassuming a specific character role, an interrogator model simulating user\nbehavior, and a judge model evaluating conversation quality. We conducted\nexperiments comparing automated evaluations with human annotations to validate\nour approach, demonstrating strong correlations across multiple criteria. This\nwork provides a foundation for a robust and dynamic evaluation of model\ncapabilities in interactive scenarios.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation\",\"authors\":\"Ilya Gusev\",\"doi\":\"arxiv-2409.06820\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a novel benchmark for evaluating the role-playing capabilities\\nof language models. Our approach leverages language models themselves to\\nemulate users in dynamic, multi-turn conversations and to assess the resulting\\ndialogues. The framework consists of three main components: a player model\\nassuming a specific character role, an interrogator model simulating user\\nbehavior, and a judge model evaluating conversation quality. We conducted\\nexperiments comparing automated evaluations with human annotations to validate\\nour approach, demonstrating strong correlations across multiple criteria. This\\nwork provides a foundation for a robust and dynamic evaluation of model\\ncapabilities in interactive scenarios.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06820\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06820","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了一种评估语言模型角色扮演能力的新基准。我们的方法利用语言模型本身来模拟动态、多回合对话中的用户，并评估由此产生的对话。该框架由三个主要部分组成：扮演特定角色的玩家模型、模拟用户行为的审讯者模型以及评估对话质量的评判者模型。我们对自动评估和人工注释进行了比较实验，以验证我们的方法，结果表明在多个标准之间存在很强的相关性。这项工作为在交互场景中对模型能力进行稳健而动态的评估奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量