{"title":"乒乓球:带用户模拟和多模型评估的角色扮演语言模型基准","authors":"Ilya Gusev","doi":"arxiv-2409.06820","DOIUrl":null,"url":null,"abstract":"We introduce a novel benchmark for evaluating the role-playing capabilities\nof language models. Our approach leverages language models themselves to\nemulate users in dynamic, multi-turn conversations and to assess the resulting\ndialogues. The framework consists of three main components: a player model\nassuming a specific character role, an interrogator model simulating user\nbehavior, and a judge model evaluating conversation quality. We conducted\nexperiments comparing automated evaluations with human annotations to validate\nour approach, demonstrating strong correlations across multiple criteria. This\nwork provides a foundation for a robust and dynamic evaluation of model\ncapabilities in interactive scenarios.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation\",\"authors\":\"Ilya Gusev\",\"doi\":\"arxiv-2409.06820\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a novel benchmark for evaluating the role-playing capabilities\\nof language models. Our approach leverages language models themselves to\\nemulate users in dynamic, multi-turn conversations and to assess the resulting\\ndialogues. The framework consists of three main components: a player model\\nassuming a specific character role, an interrogator model simulating user\\nbehavior, and a judge model evaluating conversation quality. We conducted\\nexperiments comparing automated evaluations with human annotations to validate\\nour approach, demonstrating strong correlations across multiple criteria. This\\nwork provides a foundation for a robust and dynamic evaluation of model\\ncapabilities in interactive scenarios.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06820\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06820","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
We introduce a novel benchmark for evaluating the role-playing capabilities
of language models. Our approach leverages language models themselves to
emulate users in dynamic, multi-turn conversations and to assess the resulting
dialogues. The framework consists of three main components: a player model
assuming a specific character role, an interrogator model simulating user
behavior, and a judge model evaluating conversation quality. We conducted
experiments comparing automated evaluations with human annotations to validate
our approach, demonstrating strong correlations across multiple criteria. This
work provides a foundation for a robust and dynamic evaluation of model
capabilities in interactive scenarios.