Class-aware Self-Attention for Audio Event Recognition

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval Pub Date : 2018-06-05 DOI:10.1145/3206025.3206067

Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann

{"title":"Class-aware Self-Attention for Audio Event Recognition","authors":"Shizhe Chen, Jia Chen, Qin Jin, Alexander Hauptmann","doi":"10.1145/3206025.3206067","DOIUrl":null,"url":null,"abstract":"Audio event recognition (AER) has been an important research problem with a wide range of applications. However, it is very challenging to develop large scale audio event recognition models. On the one hand, usually there are only \"weak\" labeled audio training data available, which only contains labels of audio events without temporal boundaries. On the other hand, the distribution of audio events is generally long-tailed, with only a few positive samples for large amounts of audio events. These two issues make it hard to learn discriminative acoustic features to recognize audio events especially for long-tailed events. In this paper, we propose a novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition. Since a target audio event only occurs in part of an entire audio clip and its corresponding temporal interval varies, the proposed class-aware self-attention approach learns to highlight relevant temporal intervals and to suppress irrelevant noises at the same time. In order to learn attention patterns effectively for those long-tailed events, we combine both the domain knowledge and data driven strategies to share attention factors in the proposed attention mechanism, which transfers the common knowledge learned from other similar events to the rare events. The proposed attention mechanism is a pluggable component and can be trained end-to-end in the overall AER model. We evaluate our model on a large-scale audio event corpus \"Audio Set\" with both short-term and long-term acoustic features. The experimental results demonstrate the effectiveness of our model, which improves the overall audio event recognition performance with different acoustic features especially for events with low resources. Moreover, the experiments also show that our proposed model is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Audio event recognition (AER) has been an important research problem with a wide range of applications. However, it is very challenging to develop large scale audio event recognition models. On the one hand, usually there are only "weak" labeled audio training data available, which only contains labels of audio events without temporal boundaries. On the other hand, the distribution of audio events is generally long-tailed, with only a few positive samples for large amounts of audio events. These two issues make it hard to learn discriminative acoustic features to recognize audio events especially for long-tailed events. In this paper, we propose a novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition. Since a target audio event only occurs in part of an entire audio clip and its corresponding temporal interval varies, the proposed class-aware self-attention approach learns to highlight relevant temporal intervals and to suppress irrelevant noises at the same time. In order to learn attention patterns effectively for those long-tailed events, we combine both the domain knowledge and data driven strategies to share attention factors in the proposed attention mechanism, which transfers the common knowledge learned from other similar events to the rare events. The proposed attention mechanism is a pluggable component and can be trained end-to-end in the overall AER model. We evaluate our model on a large-scale audio event corpus "Audio Set" with both short-term and long-term acoustic features. The experimental results demonstrate the effectiveness of our model, which improves the overall audio event recognition performance with different acoustic features especially for events with low resources. Moreover, the experiments also show that our proposed model is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.

查看原文本刊更多论文

音频事件识别的类感知自关注

音频事件识别(AER)是一个重要的研究课题，具有广泛的应用前景。然而，开发大规模的音频事件识别模型是非常具有挑战性的。一方面，通常只有“弱”标记的音频训练数据可用，这些数据只包含音频事件的标签，没有时间边界。另一方面，音频事件的分布通常是长尾的，对于大量音频事件，只有少数正样本。这两个问题使得学习辨别声学特征来识别音频事件变得困难，特别是对于长尾事件。在本文中，我们提出了一种新颖的类感知自注意机制，该机制具有注意因子共享，可用于音频事件识别的区分剪辑级特征。由于目标音频事件只发生在整个音频片段的一部分，并且其对应的时间间隔是不同的，因此所提出的类别感知自注意方法学习突出相关的时间间隔，同时抑制无关的噪声。为了有效地学习长尾事件的注意模式，我们将领域知识和数据驱动策略相结合，在所提出的注意机制中共享注意因子，将从其他类似事件中学习到的公共知识转移到罕见事件中。提出的注意力机制是一个可插拔组件，可以在整个AER模型中进行端到端训练。我们在具有短期和长期声学特征的大规模音频事件语料库“audio Set”上评估了我们的模型。实验结果证明了该模型的有效性，提高了不同声学特征的音频事件识别的整体性能，特别是对于资源较少的事件。此外，实验还表明，我们提出的模型能够在不干扰先前学习的音频事件的情况下，通过少量的训练样本有效地学习新的音频事件。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量