{"title":"Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots","authors":"Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, Dong Yu","doi":"arxiv-2409.10277","DOIUrl":"https://doi.org/arxiv-2409.10277","url":null,"abstract":"We introduce Cognitive Kernel, an open-source agent system towards the goal\u0000of generalist autopilots. Unlike copilot systems, which primarily rely on users\u0000to provide essential state information (e.g., task descriptions) and assist\u0000users by answering questions or auto-completing contents, autopilot systems\u0000must complete tasks from start to finish independently, which requires the\u0000system to acquire the state information from the environments actively. To\u0000achieve this, an autopilot system should be capable of understanding user\u0000intents, actively gathering necessary information from various real-world\u0000sources, and making wise decisions. Cognitive Kernel adopts a model-centric\u0000design. In our implementation, the central policy model (a fine-tuned LLM)\u0000initiates interactions with the environment using a combination of atomic\u0000actions, such as opening files, clicking buttons, saving intermediate results\u0000to memory, or calling the LLM itself. This differs from the widely used\u0000environment-centric design, where a task-specific environment with predefined\u0000actions is fixed, and the policy model is limited to selecting the correct\u0000action from a given set of options. Our design facilitates seamless information\u0000flow across various sources and provides greater flexibility. We evaluate our\u0000system in three use cases: real-time information management, private\u0000information management, and long-term memory management. The results\u0000demonstrate that Cognitive Kernel achieves better or comparable performance to\u0000other closed-source systems in these scenarios. Cognitive Kernel is fully\u0000dockerized, ensuring everyone can deploy it privately and securely. We\u0000open-source the system and the backbone model to encourage further research on\u0000LLM-driven autopilot systems.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Control With Human-Like Reasoning: Exploring Language Model Embodied Air Traffic Agents","authors":"Justas Andriuškevičius, Junzi Sun","doi":"arxiv-2409.09717","DOIUrl":"https://doi.org/arxiv-2409.09717","url":null,"abstract":"Recent developments in language models have created new opportunities in air\u0000traffic control studies. The current focus is primarily on text and\u0000language-based use cases. However, these language models may offer a higher\u0000potential impact in the air traffic control domain, thanks to their ability to\u0000interact with air traffic environments in an embodied agent form. They also\u0000provide a language-like reasoning capability to explain their decisions, which\u0000has been a significant roadblock for the implementation of automatic air\u0000traffic control. This paper investigates the application of a language model-based agent with\u0000function-calling and learning capabilities to resolve air traffic conflicts\u0000without human intervention. The main components of this research are\u0000foundational large language models, tools that allow the agent to interact with\u0000the simulator, and a new concept, the experience library. An innovative part of\u0000this research, the experience library, is a vector database that stores\u0000synthesized knowledge that agents have learned from interactions with the\u0000simulations and language models. To evaluate the performance of our language model-based agent, both\u0000open-source and closed-source models were tested. The results of our study\u0000reveal significant differences in performance across various configurations of\u0000the language model-based agents. The best-performing configuration was able to\u0000solve almost all 120 but one imminent conflict scenarios, including up to four\u0000aircraft at the same time. Most importantly, the agents are able to provide\u0000human-level text explanations on traffic situations and conflict resolution\u0000strategies.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison","authors":"Judy Hanwen Shen, Archit Sharma, Jun Qin","doi":"arxiv-2409.09603","DOIUrl":"https://doi.org/arxiv-2409.09603","url":null,"abstract":"The goal of aligning language models to human preferences requires data that\u0000reveal these preferences. Ideally, time and money can be spent carefully\u0000collecting and tailoring bespoke preference data to each downstream\u0000application. However, in practice, a select few publicly available preference\u0000datasets are often used to train reward models for reinforcement learning from\u0000human feedback (RLHF). While new preference datasets are being introduced with\u0000increasing frequency, there are currently no existing efforts to measure and\u0000compare these datasets. In this paper, we systematically study preference\u0000datasets through three perspectives: scale, label noise, and information\u0000content. We propose specific metrics for each of these perspectives and uncover\u0000different axes of comparison for a better understanding of preference datasets.\u0000Our work is a first step towards a data-centric approach to alignment by\u0000providing perspectives that aid in training efficiency and iterative data\u0000collection for RLHF.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang
{"title":"Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models","authors":"Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, Huaimin Wang","doi":"arxiv-2409.09345","DOIUrl":"https://doi.org/arxiv-2409.09345","url":null,"abstract":"Agents significantly enhance the capabilities of standalone Large Language\u0000Models (LLMs) by perceiving environments, making decisions, and executing\u0000actions. However, LLM agents still face challenges in tasks that require\u0000multiple decision-making steps. Estimating the value of actions in specific\u0000tasks is difficult when intermediate actions are neither appropriately rewarded\u0000nor penalized. In this paper, we propose leveraging a task-relevant Q-value\u0000model to guide action selection. Specifically, we first collect decision-making\u0000trajectories annotated with step-level Q values via Monte Carlo Tree Search\u0000(MCTS) and construct preference data. We then use another LLM to fit these\u0000preferences through step-level Direct Policy Optimization (DPO), which serves\u0000as the Q-value model. During inference, at each decision-making step, LLM\u0000agents select the action with the highest Q value before interacting with the\u0000environment. We apply our method to various open-source and API-based LLM\u0000agents, demonstrating that Q-value models significantly improve their\u0000performance. Notably, the performance of the agent built with\u0000Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when\u0000enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally,\u0000Q-value models offer several advantages, such as generalization to different\u0000LLM agents and seamless integration with existing prompting strategies.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autonomous Goal Detection and Cessation in Reinforcement Learning: A Case Study on Source Term Estimation","authors":"Yiwei Shi, Muning Wen, Qi Zhang, Weinan Zhang, Cunjia Liu, Weiru Liu","doi":"arxiv-2409.09541","DOIUrl":"https://doi.org/arxiv-2409.09541","url":null,"abstract":"Reinforcement Learning has revolutionized decision-making processes in\u0000dynamic environments, yet it often struggles with autonomously detecting and\u0000achieving goals without clear feedback signals. For example, in a Source Term\u0000Estimation problem, the lack of precise environmental information makes it\u0000challenging to provide clear feedback signals and to define and evaluate how\u0000the source's location is determined. To address this challenge, the Autonomous\u0000Goal Detection and Cessation (AGDC) module was developed, enhancing various RL\u0000algorithms by incorporating a self-feedback mechanism for autonomous goal\u0000detection and cessation upon task completion. Our method effectively identifies\u0000and ceases undefined goals by approximating the agent's belief, significantly\u0000enhancing the capabilities of RL algorithms in environments with limited\u0000feedback. To validate effectiveness of our approach, we integrated AGDC with\u0000deep Q-Network, proximal policy optimization, and deep deterministic policy\u0000gradient algorithms, and evaluated its performance on the Source Term\u0000Estimation problem. The experimental results showed that AGDC-enhanced RL\u0000algorithms significantly outperformed traditional statistical methods such as\u0000infotaxis, entrotaxis, and dual control for exploitation and exploration, as\u0000well as a non-statistical random action selection method. These improvements\u0000were evident in terms of success rate, mean traveled distance, and search time,\u0000highlighting AGDC's effectiveness and efficiency in complex, real-world\u0000scenarios.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Developing an Algorithm Selector for Green Configuration in Scheduling Problems","authors":"Carlos March, Christian Perez, Miguel A. Salido","doi":"arxiv-2409.08641","DOIUrl":"https://doi.org/arxiv-2409.08641","url":null,"abstract":"The Job Shop Scheduling Problem (JSP) is central to operations research,\u0000primarily optimizing energy efficiency due to its profound environmental and\u0000economic implications. Efficient scheduling enhances production metrics and\u0000mitigates energy consumption, thus effectively balancing productivity and\u0000sustainability objectives. Given the intricate and diverse nature of JSP\u0000instances, along with the array of algorithms developed to tackle these\u0000challenges, an intelligent algorithm selection tool becomes paramount. This\u0000paper introduces a framework designed to identify key problem features that\u0000characterize its complexity and guide the selection of suitable algorithms.\u0000Leveraging machine learning techniques, particularly XGBoost, the framework\u0000recommends optimal solvers such as GUROBI, CPLEX, and GECODE for efficient JSP\u0000scheduling. GUROBI excels with smaller instances, while GECODE demonstrates\u0000robust scalability for complex scenarios. The proposed algorithm selector\u0000achieves an accuracy of 84.51% in recommending the best algorithm for solving\u0000new JSP instances, highlighting its efficacy in algorithm selection. By\u0000refining feature extraction methodologies, the framework aims to broaden its\u0000applicability across diverse JSP scenarios, thereby advancing efficiency and\u0000sustainability in manufacturing logistics.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks","authors":"Tianlong Wang, Xueting Han, Jing Bai","doi":"arxiv-2409.08642","DOIUrl":"https://doi.org/arxiv-2409.08642","url":null,"abstract":"Post-training large language models (LLMs) to develop reasoning capabilities\u0000has proven effective across diverse domains, such as mathematical reasoning and\u0000code generation. However, existing methods primarily focus on improving\u0000task-specific reasoning but have not adequately addressed the model's\u0000generalization capabilities across a broader range of reasoning tasks. To\u0000tackle this challenge, we introduce Critical Planning Step Learning (CPL),\u0000which leverages Monte Carlo Tree Search (MCTS) to explore diverse planning\u0000steps in multi-step reasoning tasks. Based on long-term outcomes, CPL learns\u0000step-level planning preferences to improve the model's planning capabilities\u0000and, consequently, its general reasoning capabilities. Furthermore, while\u0000effective in many scenarios for aligning LLMs, existing preference learning\u0000approaches like Direct Preference Optimization (DPO) struggle with complex\u0000multi-step reasoning tasks due to their inability to capture fine-grained\u0000supervision at each step. We propose Step-level Advantage Preference\u0000Optimization (Step-APO), which integrates an advantage estimate for step-level\u0000preference pairs obtained via MCTS into the DPO. This enables the model to more\u0000effectively learn critical intermediate planning steps, thereby further\u0000improving its generalization in reasoning tasks. Experimental results\u0000demonstrate that our method, trained exclusively on GSM8K and MATH, not only\u0000significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also\u0000enhances out-of-domain reasoning benchmarks, such as ARC-C (+4.0%), BBH\u0000(+1.8%), MMLU-STEM (+2.2%), and MMLU (+0.9%).","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xueli Pan, Jacco van Ossenbruggen, Victor de Boer, Zhisheng Huang
{"title":"A RAG Approach for Generating Competency Questions in Ontology Engineering","authors":"Xueli Pan, Jacco van Ossenbruggen, Victor de Boer, Zhisheng Huang","doi":"arxiv-2409.08820","DOIUrl":"https://doi.org/arxiv-2409.08820","url":null,"abstract":"Competency question (CQ) formulation is central to several ontology\u0000development and evaluation methodologies. Traditionally, the task of crafting\u0000these competency questions heavily relies on the effort of domain experts and\u0000knowledge engineers which is often time-consuming and labor-intensive. With the\u0000emergence of Large Language Models (LLMs), there arises the possibility to\u0000automate and enhance this process. Unlike other similar works which use\u0000existing ontologies or knowledge graphs as input to LLMs, we present a\u0000retrieval-augmented generation (RAG) approach that uses LLMs for the automatic\u0000generation of CQs given a set of scientific papers considered to be a domain\u0000knowledge base. We investigate its performance and specifically, we study the\u0000impact of different number of papers to the RAG and different temperature\u0000setting of the LLM. We conduct experiments using GPT-4 on two domain ontology\u0000engineering tasks and compare results against ground-truth CQs constructed by\u0000domain experts. Empirical assessments on the results, utilizing evaluation\u0000metrics (precision and consistency), reveal that compared to zero-shot\u0000prompting, adding relevant domain knowledge to the RAG improves the performance\u0000of LLMs on generating CQs for concrete ontology engineering tasks.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kim van den Houten, Léon Planken, Esteban Freydell, David M. J. Tax, Mathijs de Weerdt
{"title":"Proactive and Reactive Constraint Programming for Stochastic Project Scheduling with Maximal Time-Lags","authors":"Kim van den Houten, Léon Planken, Esteban Freydell, David M. J. Tax, Mathijs de Weerdt","doi":"arxiv-2409.09107","DOIUrl":"https://doi.org/arxiv-2409.09107","url":null,"abstract":"This study investigates scheduling strategies for the stochastic\u0000resource-constrained project scheduling problem with maximal time lags\u0000(SRCPSP/max)). Recent advances in Constraint Programming (CP) and Temporal\u0000Networks have reinvoked interest in evaluating the advantages and drawbacks of\u0000various proactive and reactive scheduling methods. First, we present a new,\u0000CP-based fully proactive method. Second, we show how a reactive approach can be\u0000constructed using an online rescheduling procedure. A third contribution is\u0000based on partial order schedules and uses Simple Temporal Networks with\u0000Uncertainty (STNUs). Our statistical analysis shows that the STNU-based\u0000algorithm performs best in terms of solution quality, while also showing good\u0000relative offline and online computation time.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142252771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
{"title":"Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale","authors":"Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui","doi":"arxiv-2409.08264","DOIUrl":"https://doi.org/arxiv-2409.08264","url":null,"abstract":"Large language models (LLMs) show remarkable potential to act as computer\u0000agents, enhancing human productivity and software accessibility in multi-modal\u0000tasks that require planning and reasoning. However, measuring agent performance\u0000in realistic environments remains a challenge since: (i) most benchmarks are\u0000limited to specific modalities or domains (e.g. text-only, web navigation, Q&A,\u0000coding) and (ii) full benchmark evaluations are slow (on order of magnitude of\u0000days) given the multi-step sequential nature of tasks. To address these\u0000challenges, we introduce the Windows Agent Arena: a reproducible, general\u0000environment focusing exclusively on the Windows operating system (OS) where\u0000agents can operate freely within a real Windows OS and use the same wide range\u0000of applications, tools, and web browsers available to human users when solving\u0000tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse\u0000Windows tasks across representative domains that require agent abilities in\u0000planning, screen understanding, and tool usage. Our benchmark is scalable and\u0000can be seamlessly parallelized in Azure for a full benchmark evaluation in as\u0000little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we\u0000also introduce a new multi-modal agent, Navi. Our agent achieves a success rate\u0000of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted\u0000human. Navi also demonstrates strong performance on another popular web-based\u0000benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis\u0000of Navi's performance, and provide insights into the opportunities for future\u0000research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}