Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas
{"title":"benchching:在基于文本的叙事游戏任务中评估遵循结构化输出格式指令的大型语言模型的基准","authors":"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas","doi":"10.1109/TG.2025.3529117","DOIUrl":null,"url":null,"abstract":"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 3","pages":"665-675"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks\",\"authors\":\"Pittawat Taveekitworachai;Mury F. Dewantoro;Yi Xia;Pratch Suntichaikul;Ruck Thawonmas\",\"doi\":\"10.1109/TG.2025.3529117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 3\",\"pages\":\"665-675\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10840256/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10840256/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks
In this article, we present BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. In addition, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62 to 72.3 B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.