ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-16 DOI:10.1109/TASLP.2024.3428908

Sara Atito Ali Ahmed;Muhammad Awais;Wenwu Wang;Mark D. Plumbley;Josef Kittler

{"title":"ASiT: Local-Global Audio Spectrogram Vision Transformer for Event Classification","authors":"Sara Atito Ali Ahmed;Muhammad Awais;Wenwu Wang;Mark D. Plumbley;Josef Kittler","doi":"10.1109/TASLP.2024.3428908","DOIUrl":null,"url":null,"abstract":"Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \n<bold>L\nocal-\n<bold>G\nlobal \n<bold>A\nudio \n<bold>S\npectrogram v\n<bold>I\nsion \n<bold>T\nransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3684-3693"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10599807/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose L ocal- G lobal A udio S pectrogram v I sion T ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

查看原文本刊更多论文

ASiT：用于事件分类的本地-全局音频谱图 vIsion 变换器

变换器最初是为自然语言处理而开发的，由于其在学习远距离关系方面的灵活性，最近引起了计算机视觉和音频界的极大兴趣。尽管自然图像和音频领域存在巨大差距，但受限于变换器的数据饥渴特性和有限的标记数据量，大多数基于变换器的音频任务模型都是根据 ImageNet 预训练模型进行微调的。这激发了对音频变换器进行自监督预训练的研究，该研究减少了对大量标记数据的依赖，专注于提取音频频谱图的简明表示。在本文中，我们提出了 "本地-全局音频频谱声变换器"（即 ASiT），这是一种新型的自监督学习框架，它通过使用群组掩蔽模型学习和自颤动来捕捉本地和全局上下文信息。我们在音频和语音分类任务中评估了预训练模型，包括音频事件分类、关键词定位和说话人识别。我们还进行了全面的消减研究，包括对不同预训练策略的评估。所提出的 ASiT 框架大大提高了所有任务的性能，并在五项音频和语音分类任务中创造了新的一流性能，超越了最近的方法，包括使用额外数据集进行预训练的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.