探索音频事件识别中人类感知与模型推理之间的差异

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06580

Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley

{"title":"探索音频事件识别中人类感知与模型推理之间的差异","authors":"Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley","doi":"arxiv-2409.06580","DOIUrl":null,"url":null,"abstract":"Audio Event Recognition (AER) traditionally focuses on detecting and\nidentifying audio events. Most existing AER models tend to detect all potential\nevents without considering their varying significance across different\ncontexts. This makes the AER results detected by existing models often have a\nlarge discrepancy with human auditory perception. Although this is a critical\nand significant issue, it has not been extensively studied by the Detection and\nClassification of Sound Scenes and Events (DCASE) community because solving it\nis time-consuming and labour-intensive. To address this issue, this paper\nintroduces the concept of semantic importance in AER, focusing on exploring the\ndifferences between human perception and model inference. This paper constructs\na Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which\ncomprises audio recordings labelled by 10 professional annotators. Through\nlabelling frequency and variance, the MAFAR dataset facilitates the\nquantification of semantic importance and analysis of human perception. By\ncomparing human annotations with the predictions of ensemble pre-trained\nmodels, this paper uncovers a significant gap between human perception and\nmodel inference in both semantic identification and existence detection of\naudio events. Experimental results reveal that human perception tends to ignore\nsubtle or trivial events in the event semantic identification, while model\ninference is easily affected by events with noises. Meanwhile, in event\nexistence detection, models are usually more sensitive than humans.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"164 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Differences between Human Perception and Model Inference in Audio Event Recognition\",\"authors\":\"Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley\",\"doi\":\"arxiv-2409.06580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio Event Recognition (AER) traditionally focuses on detecting and\\nidentifying audio events. Most existing AER models tend to detect all potential\\nevents without considering their varying significance across different\\ncontexts. This makes the AER results detected by existing models often have a\\nlarge discrepancy with human auditory perception. Although this is a critical\\nand significant issue, it has not been extensively studied by the Detection and\\nClassification of Sound Scenes and Events (DCASE) community because solving it\\nis time-consuming and labour-intensive. To address this issue, this paper\\nintroduces the concept of semantic importance in AER, focusing on exploring the\\ndifferences between human perception and model inference. This paper constructs\\na Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which\\ncomprises audio recordings labelled by 10 professional annotators. Through\\nlabelling frequency and variance, the MAFAR dataset facilitates the\\nquantification of semantic importance and analysis of human perception. By\\ncomparing human annotations with the predictions of ensemble pre-trained\\nmodels, this paper uncovers a significant gap between human perception and\\nmodel inference in both semantic identification and existence detection of\\naudio events. Experimental results reveal that human perception tends to ignore\\nsubtle or trivial events in the event semantic identification, while model\\ninference is easily affected by events with noises. Meanwhile, in event\\nexistence detection, models are usually more sensitive than humans.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"164 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06580\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

音频事件识别（AER）的传统重点是检测和识别音频事件。大多数现有的 AER 模型倾向于检测所有潜在事件，而不考虑它们在不同语境下的不同重要性。这使得现有模型检测到的 AER 结果往往与人类的听觉感知存在较大差异。虽然这是一个重要的关键问题，但由于解决这个问题耗时耗力，声场和事件检测与分类（DCASE）领域尚未对此进行广泛研究。为了解决这个问题，本文引入了 AER 中语义重要性的概念，重点探索人类感知和模型推理之间的差异。本文构建了一个多注释前景音频事件识别（MAFAR）数据集，其中包括由 10 位专业注释者标注的音频录音。通过标注频率和方差，MAFAR 数据集有助于语义重要性的量化和人类感知的分析。通过比较人类注释和预训练模型的预测，本文揭示了人类感知和模型推理之间在语义识别和音频事件存在性检测方面的显著差距。实验结果表明，在事件语义识别中，人类感知往往会忽略细微或琐碎的事件，而模型推理则很容易受到带噪声事件的影响。同时，在事件存在性检测中，模型通常比人类更敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Differences between Human Perception and Model Inference in Audio Event Recognition

Audio Event Recognition (AER) traditionally focuses on detecting and identifying audio events. Most existing AER models tend to detect all potential events without considering their varying significance across different contexts. This makes the AER results detected by existing models often have a large discrepancy with human auditory perception. Although this is a critical and significant issue, it has not been extensively studied by the Detection and Classification of Sound Scenes and Events (DCASE) community because solving it is time-consuming and labour-intensive. To address this issue, this paper introduces the concept of semantic importance in AER, focusing on exploring the differences between human perception and model inference. This paper constructs a Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which comprises audio recordings labelled by 10 professional annotators. Through labelling frequency and variance, the MAFAR dataset facilitates the quantification of semantic importance and analysis of human perception. By comparing human annotations with the predictions of ensemble pre-trained models, this paper uncovers a significant gap between human perception and model inference in both semantic identification and existence detection of audio events. Experimental results reveal that human perception tends to ignore subtle or trivial events in the event semantic identification, while model inference is easily affected by events with noises. Meanwhile, in event existence detection, models are usually more sensitive than humans.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量