Benchmarking Large Language Model Capabilities for Conditional Generation

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-29 DOI:10.48550/arXiv.2306.16793

Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann

{"title":"Benchmarking Large Language Model Capabilities for Conditional Generation","authors":"Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann","doi":"10.48550/arXiv.2306.16793","DOIUrl":null,"url":null,"abstract":"Pre-trained large language models (PLMs) underly most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM and associated techniques like fewshot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks–while they can be used to compare systems at a high level–relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages. They further inform practitioners as to which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.16793","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Pre-trained large language models (PLMs) underly most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM and associated techniques like fewshot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks–while they can be used to compare systems at a high level–relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages. They further inform practitioners as to which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

查看原文本刊更多论文

条件生成大型语言模型能力的基准测试

预训练的大型语言模型(plm)是自然语言处理领域最新发展的基础。他们已经将该领域从特定于应用程序的模型管道转变为适应广泛任务的单一模型。自回归plm(如GPT-3或PaLM)和相关技术(如few - shot learning)也将输出方式转变为生成，而不是分类或回归。尽管语言模型被广泛使用，但当引入这些模型时，很少评估语言模型的生成质量。此外，目前还不清楚现有的生成任务(虽然它们可以用于比较高层次的系统)如何与人们已经采用它们的现实世界用例相关联。在这项工作中，我们讨论了如何使现有的特定于应用程序的生成基准适应于plm，并对plm在自然语言生成任务中的局限性和能力进行了深入的实证研究，这些任务涉及规模、架构、输入和输出语言等维度。我们的研究结果表明，plm在不同数据体系的适用性和对多种语言的泛化方面存在差异。它们进一步告知从业者在给定的生成任务设置中使用哪个plm。我们分享了在即将到来的plm开发过程中对生成能力进行基准测试时要考虑的最佳实践。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量