声学场景分类的深度语义学习

IF 1.9 3区计算机科学 Q2 ACOUSTICS

Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-01-03 DOI:10.1186/s13636-023-00323-5

Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang

{"title":"声学场景分类的深度语义学习","authors":"Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang","doi":"10.1186/s13636-023-00323-5","DOIUrl":null,"url":null,"abstract":"Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep semantic learning for acoustic scene classification\",\"authors\":\"Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang\",\"doi\":\"10.1186/s13636-023-00323-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.\",\"PeriodicalId\":49202,\"journal\":{\"name\":\"Eurasip Journal on Audio Speech and Music Processing\",\"volume\":\"61 1\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2024-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eurasip Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-023-00323-5\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-023-00323-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

声学场景分类（ASC）是指识别记录音频信号的声学环境或场景的过程。在这项工作中，我们提出了一种基于编码器-解码器的 ASC 方法，该方法借鉴了图像语义分割任务中的 SegNet。我们还提出了一种名为 "混合归一化"（Mixup Normalization）的新型特征归一化方法，该方法结合了信道实例归一化和混合归一化方法，以学习场景的有用信息，并摒弃与不同设备相关的特定信息。此外，我们还提出了一个事件提取模块，可以从分割网络中提取准确的语义分割区域，以模仿图像分割对音频特征的影响。通过四种数据增强技术，我们的最佳单一系统在声学场景和事件检测与分类（DCASE）2020 ASC 任务 1A 数据集上的不同设备上取得了 71.26% 的平均准确率。该结果表明，与 DCASE 2020 挑战任务 1A 基准系统相比，最小差值为 17%。与其他最先进的 CNN 模型相比，该系统具有更低的复杂度和更高的性能，而且除官方挑战数据集外未使用任何补充数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep semantic learning for acoustic scene classification

Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.