Jun Gao , Junlin Cui , Huijia Wu , Liuyu Xiang , Han Zhao , Xiangang Li , Meng Fang , Yaodong Yang , Zhaofeng He
{"title":"Can large language models independently complete tasks? A dynamic evaluation framework for multi-turn task planning and completion","authors":"Jun Gao , Junlin Cui , Huijia Wu , Liuyu Xiang , Han Zhao , Xiangang Li , Meng Fang , Yaodong Yang , Zhaofeng He","doi":"10.1016/j.neucom.2025.130135","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) are increasingly relied upon for multi-turn dialogue to conduct complex tasks. However, existing benchmarks mainly evaluate LLMs as agents, overlooking their potential as independent systems to accomplish complex tasks. In addition, these benchmarks typically evaluate the planning and completion capabilities of the models individually, rather than simultaneously. To address these issues, we propose a new <strong>Dynamic Evaluation Framework for Multi-Turn task planning and completion (DEF-MT)</strong> to assess the ability of LLM to independently complete complex tasks in multi-turn scenarios. Our approach quantifies the model’s planning capability by guiding it to generate planning and responses sequentially. Simultaneously, we use a dynamic approach to generate data that simulates the complex intents of real users. Finally, experiments conducted on 9 mainstream models using the Multiwoz 2.2 dataset, indicate that the existing models’ sub-task planning capabilities hinder their ability to complete complex tasks, providing a meaningful reference for the future optimization direction of LLM.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"638 ","pages":"Article 130135"},"PeriodicalIF":5.5000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225008070","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Can large language models independently complete tasks? A dynamic evaluation framework for multi-turn task planning and completion
Large language models (LLMs) are increasingly relied upon for multi-turn dialogue to conduct complex tasks. However, existing benchmarks mainly evaluate LLMs as agents, overlooking their potential as independent systems to accomplish complex tasks. In addition, these benchmarks typically evaluate the planning and completion capabilities of the models individually, rather than simultaneously. To address these issues, we propose a new Dynamic Evaluation Framework for Multi-Turn task planning and completion (DEF-MT) to assess the ability of LLM to independently complete complex tasks in multi-turn scenarios. Our approach quantifies the model’s planning capability by guiding it to generate planning and responses sequentially. Simultaneously, we use a dynamic approach to generate data that simulates the complex intents of real users. Finally, experiments conducted on 9 mainstream models using the Multiwoz 2.2 dataset, indicate that the existing models’ sub-task planning capabilities hinder their ability to complete complex tasks, providing a meaningful reference for the future optimization direction of LLM.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.