benchching：在基于文本的叙事游戏任务中评估遵循结构化输出格式指令的大型语言模型的基准

IF 2.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2025-01-14 DOI:10.1109/TG.2025.3529117

Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas

{"title":"benchching：在基于文本的叙事游戏任务中评估遵循结构化输出格式指令的大型语言模型的基准","authors":"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas","doi":"10.1109/TG.2025.3529117","DOIUrl":null,"url":null,"abstract":"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 3","pages":"665-675"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks\",\"authors\":\"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas\",\"doi\":\"10.1109/TG.2025.3529117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 3\",\"pages\":\"665-675\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10840256/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10840256/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们介绍了BenchING，这是一个新的基准，用于评估大型语言模型（llm）在基于文本的过程内容生成（PCG）任务中遵循结构化输出格式指令的能力。让llm以特定格式输出的能力被证明是有用的，因为整合llm的游戏的下游组件通常需要结构化的输出来交换信息。然而，在评估llm的这一方面，特别是在叙述PCG任务方面存在差距，这使得选择llm和设计集成这些llm的游戏或应用程序变得困难。为了演示我们的基准测试的潜力，我们使用五个选定的基于文本的PCG任务评估了9个llm生成可解析格式输出的能力。我们报告这些法学硕士在这些任务上的表现。此外，我们对错误类型进行了更详细的分类，并提出了利用llm修复这些错误的解决方案。我们还进行了一项缩放研究，调查了llm的一个新兴点，即使用8个原始大小从0.62到72.3 b不等的量化llm来修复错误格式内容的能力。此外，我们进行了一项定性研究，以评估生成内容的质量。我们将源代码和原始数据提供给未来的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks

In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量