Oliver Wang, Grant Cheng, Luc Caspar, Akira Yokota, Mahdi Khosravy, Olaf Witkowski
{"title":"Robots reading recipes: large language models as translators between humans and machines","authors":"Oliver Wang, Grant Cheng, Luc Caspar, Akira Yokota, Mahdi Khosravy, Olaf Witkowski","doi":"10.1007/s10015-025-01031-3","DOIUrl":null,"url":null,"abstract":"<div><p>Large Language Models (LLMs) are a type of machine learning model trained on vast amounts of natural language that have demonstrated novel capabilities in tasks such as text prediction and generation. These tasks allow LLMs to be remarkably suited for understanding the semantics of natural language, which in turn enables applications such as planning real world tasks, writing code for computers, and translating between human languages. Even though LLMs could provide more flexibility in interpreting user requests and have shown to possess some commonsense knowledge, their capabilities for translating natural language instructions into code to control robot actions is only starting to be explored. More specifically, in this paper we are interested in the control of robots tasked with preparing cocktails. Within this context, it is assumed that the LLM has access to a repository of well-formatted recipes. This means that each recipe is written according to the following layout: a list of ingredients, then a subsequent description of how to prepare and mix the various items. Moreover, a set of low-level modules responsible for robot manipulation and vision-related tasks is also provided to the LLM in the shape of an application programming interface (API). Consequently, the main focus of the LLM is on generating a sequence of calls to the API, along with the right parameters, to produce the cocktail requested by users in natural language. Here, we show that it is feasible for LLMs to perform this type of translation on a small number of custom modules, and that certain techniques provide a measurable benefit to the accuracy and consistency of this task without fine-tuning. We found in particular that the use of an ensemble-voting strategy, where multiple trials are repeated and the most common answer is selected, increases accuracy to a certain extent. In addition, there is moderate support for the use of natural language parsing to adjust the prompt of the LLM prior to translation. Lastly, building on previous knowledge we also provide a set of guidelines to help design prompts to improve the accuracy of the resulting sequence of actions. In general, these results suggest that while LLMs can be used as translators of robot instructions, they are best applied in conjunction with these other strategies. The impact of these findings could influence future robotics development, as it provides directions for implementing LLMs more effectively and broadening the accessibility of robotic control to users without an extensive software background.</p></div>","PeriodicalId":46050,"journal":{"name":"Artificial Life and Robotics","volume":"30 3","pages":"407 - 416"},"PeriodicalIF":0.8000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10015-025-01031-3.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Life and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s10015-025-01031-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) are a type of machine learning model trained on vast amounts of natural language that have demonstrated novel capabilities in tasks such as text prediction and generation. These tasks allow LLMs to be remarkably suited for understanding the semantics of natural language, which in turn enables applications such as planning real world tasks, writing code for computers, and translating between human languages. Even though LLMs could provide more flexibility in interpreting user requests and have shown to possess some commonsense knowledge, their capabilities for translating natural language instructions into code to control robot actions is only starting to be explored. More specifically, in this paper we are interested in the control of robots tasked with preparing cocktails. Within this context, it is assumed that the LLM has access to a repository of well-formatted recipes. This means that each recipe is written according to the following layout: a list of ingredients, then a subsequent description of how to prepare and mix the various items. Moreover, a set of low-level modules responsible for robot manipulation and vision-related tasks is also provided to the LLM in the shape of an application programming interface (API). Consequently, the main focus of the LLM is on generating a sequence of calls to the API, along with the right parameters, to produce the cocktail requested by users in natural language. Here, we show that it is feasible for LLMs to perform this type of translation on a small number of custom modules, and that certain techniques provide a measurable benefit to the accuracy and consistency of this task without fine-tuning. We found in particular that the use of an ensemble-voting strategy, where multiple trials are repeated and the most common answer is selected, increases accuracy to a certain extent. In addition, there is moderate support for the use of natural language parsing to adjust the prompt of the LLM prior to translation. Lastly, building on previous knowledge we also provide a set of guidelines to help design prompts to improve the accuracy of the resulting sequence of actions. In general, these results suggest that while LLMs can be used as translators of robot instructions, they are best applied in conjunction with these other strategies. The impact of these findings could influence future robotics development, as it provides directions for implementing LLMs more effectively and broadening the accessibility of robotic control to users without an extensive software background.