Zhor Diffallah, H. Ykhlef, Hafida Bouarfa, Nardjesse Diffallah
{"title":"Consistency Regularization-Based Polyphonic Audio Event Detection with Minimal Supervision","authors":"Zhor Diffallah, H. Ykhlef, Hafida Bouarfa, Nardjesse Diffallah","doi":"10.1109/STA56120.2022.10019247","DOIUrl":null,"url":null,"abstract":"Audio event detection refers to the task of specifying the nature of events happening in an audio stream, as well as locating these occurrences in time. Due to its wide applicability in a myriad of domains, this task has been gradually attracting interest over time. The development of the audio event detection task is largely dominated by modern deep learning techniques. Deep network architectures need a substantial amount of labeled audio clips that contain the start and end time of each event. However, collecting and annotating exhaustive datasets of audio recordings with the necessary information is both a costly and a laborious endeavour. To mend this, weakly-labeled semi-supervised learning methods have been adopted in an attempt to mitigate the labeling issue. In this work, we investigate the impact of incorporating weak labels and unlabeled clips into the training chain of audio event detectors. We have conducted our experiments on the Domestic Environment Sound Event Detection corpus (DESED); a large-scale heterogeneous dataset composed of several types of recordings and annotations. we have focused our study on methods based on consistency regularization; specifically: Mean Teacher and Interpolation Consistency Training. Our experimental results reveal that; with the proper parameterization, incorporating weakly-labeled and unlabeled data is beneficial for detecting polyphonic sound events.","PeriodicalId":430966,"journal":{"name":"2022 IEEE 21st international Ccnference on Sciences and Techniques of Automatic Control and Computer Engineering (STA)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 21st international Ccnference on Sciences and Techniques of Automatic Control and Computer Engineering (STA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/STA56120.2022.10019247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Audio event detection refers to the task of specifying the nature of events happening in an audio stream, as well as locating these occurrences in time. Due to its wide applicability in a myriad of domains, this task has been gradually attracting interest over time. The development of the audio event detection task is largely dominated by modern deep learning techniques. Deep network architectures need a substantial amount of labeled audio clips that contain the start and end time of each event. However, collecting and annotating exhaustive datasets of audio recordings with the necessary information is both a costly and a laborious endeavour. To mend this, weakly-labeled semi-supervised learning methods have been adopted in an attempt to mitigate the labeling issue. In this work, we investigate the impact of incorporating weak labels and unlabeled clips into the training chain of audio event detectors. We have conducted our experiments on the Domestic Environment Sound Event Detection corpus (DESED); a large-scale heterogeneous dataset composed of several types of recordings and annotations. we have focused our study on methods based on consistency regularization; specifically: Mean Teacher and Interpolation Consistency Training. Our experimental results reveal that; with the proper parameterization, incorporating weakly-labeled and unlabeled data is beneficial for detecting polyphonic sound events.