基于多粒度对齐的跨模态迁移学习用于端到端口语理解

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11378

Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He

{"title":"基于多粒度对齐的跨模态迁移学习用于端到端口语理解","authors":"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He","doi":"10.21437/interspeech.2022-11378","DOIUrl":null,"url":null,"abstract":"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1131-1135"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding\",\"authors\":\"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He\",\"doi\":\"10.21437/interspeech.2022-11378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"1131-1135\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

通过跨模态（文本到音频）迁移学习，端到端口语理解（E2E-SLU）取得了令人印象深刻的进步。然而，目前的方法大多侧重于具有简单损失的粗粒度序列级文本到音频的知识转移，而忽略了两种模式之间的细粒度时间对齐。在这项工作中，我们提出了一种新的用于E2E-SLU的多粒度跨模态迁移学习框架。具体来说，我们设计了一个交叉注意力模块来将文本的标记与语音的框架特征对齐，鼓励模型在传递语义信息的过程中针对每个标记所涉及的显著声学特征。我们还利用对比学习来促进句子层面的跨模态表征学习。最后，我们探索了各种数据扩充方法，以缓解E2E-SLU训练中大量标记数据的不足。在英文和中文SLU数据集上进行了大量实验，以验证我们提出的方法的有效性。实验结果和详细分析证明了该模型的优越性和竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding

End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量