Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley
{"title":"探索音频事件识别中人类感知与模型推理之间的差异","authors":"Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley","doi":"arxiv-2409.06580","DOIUrl":null,"url":null,"abstract":"Audio Event Recognition (AER) traditionally focuses on detecting and\nidentifying audio events. Most existing AER models tend to detect all potential\nevents without considering their varying significance across different\ncontexts. This makes the AER results detected by existing models often have a\nlarge discrepancy with human auditory perception. Although this is a critical\nand significant issue, it has not been extensively studied by the Detection and\nClassification of Sound Scenes and Events (DCASE) community because solving it\nis time-consuming and labour-intensive. To address this issue, this paper\nintroduces the concept of semantic importance in AER, focusing on exploring the\ndifferences between human perception and model inference. This paper constructs\na Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which\ncomprises audio recordings labelled by 10 professional annotators. Through\nlabelling frequency and variance, the MAFAR dataset facilitates the\nquantification of semantic importance and analysis of human perception. By\ncomparing human annotations with the predictions of ensemble pre-trained\nmodels, this paper uncovers a significant gap between human perception and\nmodel inference in both semantic identification and existence detection of\naudio events. Experimental results reveal that human perception tends to ignore\nsubtle or trivial events in the event semantic identification, while model\ninference is easily affected by events with noises. Meanwhile, in event\nexistence detection, models are usually more sensitive than humans.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"164 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring Differences between Human Perception and Model Inference in Audio Event Recognition\",\"authors\":\"Yizhou Tan, Yanru Wu, Yuanbo Hou, Xin Xu, Hui Bu, Shengchen Li, Dick Botteldooren, Mark D. Plumbley\",\"doi\":\"arxiv-2409.06580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio Event Recognition (AER) traditionally focuses on detecting and\\nidentifying audio events. Most existing AER models tend to detect all potential\\nevents without considering their varying significance across different\\ncontexts. This makes the AER results detected by existing models often have a\\nlarge discrepancy with human auditory perception. Although this is a critical\\nand significant issue, it has not been extensively studied by the Detection and\\nClassification of Sound Scenes and Events (DCASE) community because solving it\\nis time-consuming and labour-intensive. To address this issue, this paper\\nintroduces the concept of semantic importance in AER, focusing on exploring the\\ndifferences between human perception and model inference. This paper constructs\\na Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which\\ncomprises audio recordings labelled by 10 professional annotators. Through\\nlabelling frequency and variance, the MAFAR dataset facilitates the\\nquantification of semantic importance and analysis of human perception. By\\ncomparing human annotations with the predictions of ensemble pre-trained\\nmodels, this paper uncovers a significant gap between human perception and\\nmodel inference in both semantic identification and existence detection of\\naudio events. Experimental results reveal that human perception tends to ignore\\nsubtle or trivial events in the event semantic identification, while model\\ninference is easily affected by events with noises. Meanwhile, in event\\nexistence detection, models are usually more sensitive than humans.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"164 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06580\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploring Differences between Human Perception and Model Inference in Audio Event Recognition
Audio Event Recognition (AER) traditionally focuses on detecting and
identifying audio events. Most existing AER models tend to detect all potential
events without considering their varying significance across different
contexts. This makes the AER results detected by existing models often have a
large discrepancy with human auditory perception. Although this is a critical
and significant issue, it has not been extensively studied by the Detection and
Classification of Sound Scenes and Events (DCASE) community because solving it
is time-consuming and labour-intensive. To address this issue, this paper
introduces the concept of semantic importance in AER, focusing on exploring the
differences between human perception and model inference. This paper constructs
a Multi-Annotated Foreground Audio Event Recognition (MAFAR) dataset, which
comprises audio recordings labelled by 10 professional annotators. Through
labelling frequency and variance, the MAFAR dataset facilitates the
quantification of semantic importance and analysis of human perception. By
comparing human annotations with the predictions of ensemble pre-trained
models, this paper uncovers a significant gap between human perception and
model inference in both semantic identification and existence detection of
audio events. Experimental results reveal that human perception tends to ignore
subtle or trivial events in the event semantic identification, while model
inference is easily affected by events with noises. Meanwhile, in event
existence detection, models are usually more sensitive than humans.