Qin Ni, Yangze Yu, Yiming Ma, Xin Lin, Ciping Deng, Tingjiang Wei, Mo Xuan
{"title":"The Social Cognition Ability Evaluation of LLMs: A Dynamic Gamified Assessment and Hierarchical Social Learning Measurement Approach","authors":"Qin Ni, Yangze Yu, Yiming Ma, Xin Lin, Ciping Deng, Tingjiang Wei, Mo Xuan","doi":"10.1145/3673238","DOIUrl":null,"url":null,"abstract":"<p>Large Language Model(LLM) has shown amazing abilities in reasoning tasks, theory of mind(ToM) has been tested in many studies as part of reasoning tasks, and social learning, which is closely related to theory of mind, are still lack of investigation. However, the test methods and materials make the test results unconvincing. We propose a dynamic gamified assessment(DGA) and hierarchical social learning measurement to test ToM and social learning capacities in LLMs. The test for ToM consists of five parts. First, we extract ToM tasks from ToM experiments and then design game rules to satisfy the ToM task requirement. After that, we design ToM questions to match the game’s rules and use these to generate test materials. Finally, we go through the above steps to test the model. To assess the social learning ability, we introduce a novel set of social rules (three in total). Experiment results demonstrate that, except GPT-4, LLMs performed poorly on the ToM test but showed a certain level of social learning ability in social learning measurement.</p>","PeriodicalId":48967,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology","volume":"90 1","pages":""},"PeriodicalIF":7.2000,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3673238","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Model(LLM) has shown amazing abilities in reasoning tasks, theory of mind(ToM) has been tested in many studies as part of reasoning tasks, and social learning, which is closely related to theory of mind, are still lack of investigation. However, the test methods and materials make the test results unconvincing. We propose a dynamic gamified assessment(DGA) and hierarchical social learning measurement to test ToM and social learning capacities in LLMs. The test for ToM consists of five parts. First, we extract ToM tasks from ToM experiments and then design game rules to satisfy the ToM task requirement. After that, we design ToM questions to match the game’s rules and use these to generate test materials. Finally, we go through the above steps to test the model. To assess the social learning ability, we introduce a novel set of social rules (three in total). Experiment results demonstrate that, except GPT-4, LLMs performed poorly on the ToM test but showed a certain level of social learning ability in social learning measurement.
大语言模型(LLM)在推理任务中表现出了惊人的能力,心智理论(ToM)作为推理任务的一部分已在许多研究中进行了测试,而与心智理论密切相关的社会学习仍缺乏研究。然而,测试方法和材料使得测试结果缺乏说服力。我们提出了一种动态游戏化测评(DGA)和分层社会学习测评的方法来测试低年级学生的心智理论和社会学习能力。ToM 测试包括五个部分。首先,我们从 ToM 实验中提取 ToM 任务,然后设计游戏规则以满足 ToM 任务要求。然后,我们设计与游戏规则相匹配的 ToM 问题,并利用这些问题生成测试材料。最后,我们通过上述步骤对模型进行测试。为了评估社交学习能力,我们引入了一套新的社交规则(共三套)。实验结果表明,除 GPT-4 外,LLM 在 ToM 测试中表现较差,但在社会学习测量中表现出一定的社会学习能力。
期刊介绍:
ACM Transactions on Intelligent Systems and Technology is a scholarly journal that publishes the highest quality papers on intelligent systems, applicable algorithms and technology with a multi-disciplinary perspective. An intelligent system is one that uses artificial intelligence (AI) techniques to offer important services (e.g., as a component of a larger system) to allow integrated systems to perceive, reason, learn, and act intelligently in the real world.
ACM TIST is published quarterly (six issues a year). Each issue has 8-11 regular papers, with around 20 published journal pages or 10,000 words per paper. Additional references, proofs, graphs or detailed experiment results can be submitted as a separate appendix, while excessively lengthy papers will be rejected automatically. Authors can include online-only appendices for additional content of their published papers and are encouraged to share their code and/or data with other readers.