Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin
{"title":"SimulBench: Evaluating Language Models with Creative Simulation Tasks","authors":"Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin","doi":"arxiv-2409.07641","DOIUrl":null,"url":null,"abstract":"We introduce SimulBench, a benchmark designed to evaluate large language\nmodels (LLMs) across a diverse collection of creative simulation scenarios,\nsuch as acting as a Linux terminal or playing text games with users. While\nthese simulation tasks serve as effective measures of an LLM's general\nintelligence, they are seldom incorporated into existing benchmarks. A major\nchallenge is to develop an evaluation framework for testing different LLMs\nfairly while preserving the multi-round interactive nature of simulation tasks\nbetween users and AI. To tackle this issue, we suggest using a fixed LLM as a\nuser agent to engage with an LLM to collect dialogues first under different\ntasks. Then, challenging dialogue scripts are extracted for evaluating\ndifferent target LLMs. To facilitate automatic assessment on \\DataName{}, GPT-4\nis employed as the evaluator, tasked with reviewing the quality of the final\nresponse generated by the target LLMs given multi-turn dialogue scripts. Our\ncomprehensive experiments indicate that these simulation tasks continue to pose\na significant challenge with their unique natures and show the gap between\nproprietary models and the most advanced open LLMs. For example, GPT-4-turbo\noutperforms LLaMA-3-70b-Chat on 18.55\\% more cases.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce SimulBench, a benchmark designed to evaluate large language
models (LLMs) across a diverse collection of creative simulation scenarios,
such as acting as a Linux terminal or playing text games with users. While
these simulation tasks serve as effective measures of an LLM's general
intelligence, they are seldom incorporated into existing benchmarks. A major
challenge is to develop an evaluation framework for testing different LLMs
fairly while preserving the multi-round interactive nature of simulation tasks
between users and AI. To tackle this issue, we suggest using a fixed LLM as a
user agent to engage with an LLM to collect dialogues first under different
tasks. Then, challenging dialogue scripts are extracted for evaluating
different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4
is employed as the evaluator, tasked with reviewing the quality of the final
response generated by the target LLMs given multi-turn dialogue scripts. Our
comprehensive experiments indicate that these simulation tasks continue to pose
a significant challenge with their unique natures and show the gap between
proprietary models and the most advanced open LLMs. For example, GPT-4-turbo
outperforms LLaMA-3-70b-Chat on 18.55\% more cases.