Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos
{"title":"Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings","authors":"Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos","doi":"10.1109/TASLP.2024.3444490","DOIUrl":null,"url":null,"abstract":"Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3896-3907"},"PeriodicalIF":4.1000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637738","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637738/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.