What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure

2022 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2022-11-01 DOI:10.1109/ICDMW58026.2022.00120

Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah

{"title":"What Do Audio Transformers Hear? Probing Their Representations For Language Delivery & Structure","authors":"Yaman Kumar Singla, Jui Shah, Changyou Chen, R. Shah","doi":"10.1109/ICDMW58026.2022.00120","DOIUrl":null,"url":null,"abstract":"Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW58026.2022.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Transformer models across multiple domains such as natural language processing and speech form an unavoidable part of the tech stack of practitioners and researchers alike. Au-dio transformers that exploit representational learning to train on unlabeled speech have recently been used for tasks from speaker verification to discourse-coherence with much success. However, little is known about what these models learn and represent in the high-dimensional latent space. In this paper, we interpret two such recent state-of-the-art models, wav2vec2.0 and Mockingjay, on linguistic and acoustic features. We probe each of their layers to understand what it is learning and at the same time, we draw a distinction between the two models. By comparing their performance across a wide variety of settings including native, non-native, read and spontaneous speeches, we also show how much these models are able to learn transferable features. Our results show that the models are capable of significantly capturing a wide range of characteristics such as audio, fluency, supraseg-mental pronunciation, and even syntactic and semantic text-based characteristics. For each category of characteristics, we identify a learning pattern for each framework and conclude which model and which layer of that model is better for a specific category of feature to choose for feature extraction for downstream tasks.

查看原文本刊更多论文

音频变压器听到什么?语言表达与结构表征探析

跨多个领域(如自然语言处理和语音)的转换模型是从业者和研究人员不可避免的技术堆栈的一部分。利用表征学习对未标记语音进行训练的音频转换器最近被用于从说话人验证到话语连贯的任务，并取得了很大的成功。然而，对于这些模型在高维潜在空间中学习和表示什么，人们知之甚少。在本文中，我们解释了两个最新的最先进的模型，wav2vec2.0和Mockingjay，关于语言和声学特征。我们探测它们的每一层，以了解它在学习什么，同时，我们在两个模型之间画出区别。通过比较它们在各种环境下的表现，包括母语、非母语、阅读和自发演讲，我们也展示了这些模型能够学习到多少可转移的特征。我们的研究结果表明，这些模型能够显著地捕获广泛的特征，如音频、流利性、超心理发音，甚至是基于文本的句法和语义特征。对于每个类别的特征，我们为每个框架确定一个学习模式，并得出结论，哪个模型和该模型的哪一层更适合用于下游任务的特定类别的特征提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量