benchching:在基于文本的叙事游戏任务中评估遵循结构化输出格式指令的大型语言模型的基准

IF 2.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas
{"title":"benchching:在基于文本的叙事游戏任务中评估遵循结构化输出格式指令的大型语言模型的基准","authors":"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas","doi":"10.1109/TG.2025.3529117","DOIUrl":null,"url":null,"abstract":"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 3","pages":"665-675"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks\",\"authors\":\"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas\",\"doi\":\"10.1109/TG.2025.3529117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 3\",\"pages\":\"665-675\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10840256/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10840256/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在本文中,我们介绍了BenchING,这是一个新的基准,用于评估大型语言模型(llm)在基于文本的过程内容生成(PCG)任务中遵循结构化输出格式指令的能力。让llm以特定格式输出的能力被证明是有用的,因为整合llm的游戏的下游组件通常需要结构化的输出来交换信息。然而,在评估llm的这一方面,特别是在叙述PCG任务方面存在差距,这使得选择llm和设计集成这些llm的游戏或应用程序变得困难。为了演示我们的基准测试的潜力,我们使用五个选定的基于文本的PCG任务评估了9个llm生成可解析格式输出的能力。我们报告这些法学硕士在这些任务上的表现。此外,我们对错误类型进行了更详细的分类,并提出了利用llm修复这些错误的解决方案。我们还进行了一项缩放研究,调查了llm的一个新兴点,即使用8个原始大小从0.62到72.3 b不等的量化llm来修复错误格式内容的能力。此外,我们进行了一项定性研究,以评估生成内容的质量。我们将源代码和原始数据提供给未来的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks
In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Games
IEEE Transactions on Games Engineering-Electrical and Electronic Engineering
CiteScore
4.60
自引率
8.70%
发文量
87
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信