大型预训练变压器的多模态信息集成。

Proceedings of the conference. Association for Computational Linguistics. Meeting Pub Date : 2020-07-01 DOI:10.18653/v1/2020.acl-main.214

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, Ehsan Hoque

{"title":"大型预训练变压器的多模态信息集成。","authors":"Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, Ehsan Hoque","doi":"10.18653/v1/2020.acl-main.214","DOIUrl":null,"url":null,"abstract":"Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":" ","pages":"2359-2369"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8005298/pdf/nihms-1680563.pdf","citationCount":"229","resultStr":"{\"title\":\"Integrating Multimodal Information in Large Pretrained Transformers.\",\"authors\":\"Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, Ehsan Hoque\",\"doi\":\"10.18653/v1/2020.acl-main.214\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.\",\"PeriodicalId\":74541,\"journal\":{\"name\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"volume\":\" \",\"pages\":\"2359-2369\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8005298/pdf/nihms-1680563.pdf\",\"citationCount\":\"229\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.acl-main.214\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.acl-main.214","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 229

摘要

最近基于transformer的上下文词表示，包括BERT和XLNet，已经在NLP的多个学科中显示了最先进的性能。在任务特定数据集上对训练好的上下文模型进行微调是在下游实现卓越性能的关键。虽然对这些预训练的模型进行微调对于词汇应用程序(只有语言模态的应用程序)来说是很简单的，但对于多模态语言(NLP中一个不断发展的领域，专注于建模面对面的交流)来说，这并不容易。预先训练的模型没有必要的组件来接受视觉和听觉两种额外的模式。在本文中，我们提出了BERT和XLNet的附件，称为多模态自适应门(MAG)。MAG允许BERT和XLNet在微调期间接受多模态非语言数据。它通过生成BERT和XLNet的内部表示的转换来实现这一点;以视觉和听觉模式为条件的转变。在我们的实验中，我们研究了常用的CMU-MOSI和CMU-MOSEI数据集用于多模态情感分析。与之前的基线相比，对magg -BERT和magg -XLNet进行微调可以显著提高情感分析的性能，也可以对BERT和XLNet进行仅语言的微调。在CMU-MOSI数据集上，MAG-XLNet在NLP社区首次实现了人类水平的多模态情感分析性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Integrating Multimodal Information in Large Pretrained Transformers.

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the conference. Association for Computational Linguistics. Meeting

自引率

0.00%

发文量