为预先训练的大型语言模型提供动作跨模态功能

Natural Language Processing Journal Pub Date : 2024-04-20 DOI:10.1016/j.nlp.2024.100072

Anton Caesar, Ozan Özdemir, Cornelius Weber, Stefan Wermter

{"title":"为预先训练的大型语言模型提供动作跨模态功能","authors":"Anton Caesar, Ozan Özdemir, Cornelius Weber, Stefan Wermter","doi":"10.1016/j.nlp.2024.100072","DOIUrl":null,"url":null,"abstract":"<div><p>Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labeled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are re-established efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 model’s original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision- and language-guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: <span>https://github.com/samsoneko/CrossT5</span><svg><path></path></svg>.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"7 ","pages":"Article 100072"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000207/pdfft?md5=cc42b6eb8402b00afc108e973be38c4c&pid=1-s2.0-S2949719124000207-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Enabling action crossmodality for a pretrained large language model\",\"authors\":\"Anton Caesar, Ozan Özdemir, Cornelius Weber, Stefan Wermter\",\"doi\":\"10.1016/j.nlp.2024.100072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labeled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are re-established efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 model’s original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision- and language-guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: <span>https://github.com/samsoneko/CrossT5</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"7 \",\"pages\":\"Article 100072\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000207/pdfft?md5=cc42b6eb8402b00afc108e973be38c4c&pid=1-s2.0-S2949719124000207-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719124000207\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近来，自然语言处理和视觉任务因 Transformer 架构的兴起而有了长足的进步。高性能的大型语言模型（LLM）得益于大量在线提供的文本数据集。然而，动作和双向动作-语言任务的开发程度较低，因为这些任务需要更加具体和标注的数据。因此，我们的目标是让预训练 LLM 具备这些机器人动作能力，同时在所需的训练时间和数据量方面保持高效率。为此，我们拆分了基于变换器的 LLM，并在其中插入了多模态架构。具体来说，我们将经过预训练的 T5 LLM 拆分为编码器和解码器两个部分，然后插入成对变换自动编码器（PTAE）双向动作语言模型的跨模态变换器组件。实验在新的数据集上进行，包括单模态语言翻译和跨模态双向动作语言翻译。通过训练跨模态转换器，原始 T5 模型的自然语言能力得以有效重建，而跨模态转换器只需要 T5 模型原始训练数据的 570 万分之一。此外，被称为 CrossT5 的新模型在视觉和语言引导的机器人行动任务中实现了高准确度。根据设计，CrossT5 代理在使用未包含在数据集中的语言命令进行测试时也能稳健地行动。结果表明，这种新方法成功地将 LLM 的高级语言能力与视觉-动作模型的低级机器人控制技能结合在一起。代码可从以下网址获取：https://github.com/samsoneko/CrossT5。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enabling action crossmodality for a pretrained large language model

Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labeled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are re-established efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 model’s original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision- and language-guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: https://github.com/samsoneko/CrossT5.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量